Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
Education is not just a funding issues. Policy choices, like making it impossible for students to fail which means they have no incentive to learn anything, can be more impactful.
In my own social/family circle, there’s no correlation between net worth and how someone leans politically. I’ve never understood why given the pretty obvious pros/cons (amount paid in taxes vs. benefits received)
The people most vociferously for conservative values are middle class, small business owners, or upper class, though the true upper class are libertine (notice who participated in the Epstein affair). The working class is filled with all kinds of very diverse people united by the fact they have to work for a living and often can't afford e.g. expensive weddings. Some of them are religious, a whole bunch aren't. It's easy to be disillusioned with formal institutions that seem to not care at all about you.
Unfortunately, a lot of these people have either concluded it is too difficult to vote, can't vote, or that their votes don't matter (I don't think they're wrong). Their unions were also destroyed. Some of them vote against their interests, but it's not clear that their interests are ever represented, so they vote for change instead.
Unfortunately, people are born with a certain intellectual capacity and can't be improved beyond that with any amount of training or education. We're largely hitting peoples' capacities already.
We can't educate someone with 80 IQ to be you; we can't educate you (or I) into being Einstein. The same way we can't just train anyone to be an amazing basketball player.
I think they get to that a couple of paragraphs later:
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
It is interesting that most of our modes of interaction with AI is still just textboxes. The only big UX change in that the last three years has been the introduction of the Claude Code / OpenAI Codex tools. They feel amazing to use, like you're working with another independent mind.
I am curious what the user interfaces of AI in the future will be, I think whoever can crack that will create immense value.
Text is very information-dense. I'd much rather skim a transcript in a few seconds than watch a video.
There's a reason keyboards haven't changed much since the 1860s when typewriters were invented. We keep coming up with other fun UI like touchscreens and VR, but pretty much all real work happens on boring old keyboards.
The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
Unix CLI utilities have been all text for 50 years. Arguably that is why they are still relevant. Attempts to impose structured data on the paradigm like those in PowerShell have their adherents and can be powerful, but fail when the data doesn't fit the structure.
We see similar tendency toward the most general interfaces in "operator mode" and similar the-AI-uses-the-mouse-and-keyboard schemes. It's entirely possible for every application to provide a dedicated interface for AI use, but it turns out to be more powerful to teach the AI to understand the interfaces humans already use.
Yet the most popular platforms on the planet have people pointing a finger (or several) at a picture.
And the most popular media format on the planet is and will be (for the foreseeable future), video. Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums (and I say this as someone who loves forums).
GenAI will definitely need better UIs for the kind of universal adoption (think smartphone - 8/9 billion people).
> Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums
Video is limited by playback speed. It is a time-dependent format. Efforts can be made to enable video to be viewable at a range of speeds, but they are always somewhat constrained. Controlling video playback to slow down and rewatch certain parts is just not as nice as dealing with the same thing in text (or static images), where it’s much easier to linger and closely inspect parts that you care more about or are struggling to understand. Likewise, it’s easier to skim text than video.
This is why many people prefer transcripts, or articles, or books over videos.
I seriously doubt that people would want to switch text-based forums to video if only video were easier to make. People enjoy writing for the way it inspires a different kind of communication and thought. People like text so much that they write in journals that nobody will ever see, just because it helps them organize their thoughts.
People get a little too hung up on finding the AI UI. It does not seem all necessary that the interfaces will be much different (while the underlying tech certainly will be).
Text and boxes and tables and graphs is what we can cope with. And while the AI is going to change much, we are not.
I agree i think specifically the world is multi modal. Getting a chat to be truly multi modal .i.e interacting with different data types and text in an unified way is going to be the next big thing. Mainly given how robotics is taking off 3d might be another important aspect to it. At vlm.run we are trying to make this possible how to combine VLM's and LLM's in a seem less way to get the best UI. https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7
Ooooh, it bothers me, so, so, so much. Too perky. Weirdly casual. Also, it's based on the old 4o code - sycophancy and higher hallucinations - watch out. That said, I too love the omni models, especially when they're not nerfed. (Try asking for a Boston, New York, Parisian, Haitian, Indian and Japanese accent from 4o to explore one of the many nerfs they've done since launch)
> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.
I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
Other people spearheaded the commodity hardware towards being good enough for the server room. Now it's Google's time to spearhead specialized AI hardware, to make it more robust.
I find Gemini 3 to be really good. I'm impressed. However, the responses still seem to be bounded by the existing literature and data. If asked to come up with new ideas to improve on existing results for some math problems, it tends to recite known results only. Maybe I didn't challenge it enough or present problems that have scope for new ideas?
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.
I myself tried a similar exercise (w/Thinking with 3 Pro), seeing if it could come up with an idea that I'm currently writing up that pushes past/sharpens/revises conventional thinking on a topic. It regurgitated standard (and at times only tangentially related) lore, but it did get at the rough idea after I really spoon fed it. So I would suspect that someone being impressed with its "research" output might more reflect their own limitations rather than Gemini's capabilities. I'm sure a relevant factor is variability among fields in the quality and volume of relevant literature, though I was impressed with how it identified relevant ideas and older papers for my specific topic.
That's the inherent limit on the models, that makes humans still relevant.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Really nitpicky I know but GPT-3 was June 2020. ChatGPT was 3.5 and the author even gets that right in an image caption. That doesn’t make it any more or less impressive though.
I’m not sure even $1T has been spent. Pledged != spent.
Some estimates have it at ~$375B by the end of 2025. It makes sense, there are only so many datacenters and engineers out there and a trillion is a lot of money. It’s not like we’re in health care. :)
> But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.
I feel like I've been hearing this for at least 1.5 years at this point (since the launch of GPT 4/Claude 3). I certainly agree we've been heading in this direction but when will this become unambiguously true rather than a phrase people say?
There's no bright line - you should download some cli tools, hook up some agents to them and see what you think. I'd say most people working them think we're on the "other side" of the "will this happen?" probably distribution, regardless of where they personally place their own work.
i don't imagine there will ever be a time when it will be unambiguously true, any more than a boss could ever really unambigously say their job is "manager who directs subordinates" vs "manager who fixes subordinates' mistakes".
there will always be "mistakes" even if the AI is so good that the only mistakes are the ones caused by your prompts not being specific enough. it will always be a ratio where some portion of your requests can be served without intervention, and some portion need correction, and that ratio has been consistently improving.
Can't speak to Claude Code/Desktop, but any of the products that are VS Code forks have workspace restrictions on what folders they're allowed to access (for better and worse). Other products (like Warp terminal) that can give access to the whole filesystem come with pre-set strict deny/allow lists on what commands are allowed to be executed.
It's possible to remove some of these restrictions in these tools, or to operate with flags that skip permissions checks, but you have to intentionally do that.
> So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.
As a current graduate student, I have seen similar comments in academia. My colleagues agree that a conversation with these recent models feels like chatting with an expert in their subfields. I don't know if it represents research as a field would not be immune to advances in AI tech. I still hope this world values natural intelligence and having the drive to do things heavily than a robot brute-forcing into saying "right" things.
I have an exercise I like to do where I put two SOTA models face-to-face to talk about whatever they want.
When I did it last week with Gemini-3 and chatGPT-5.1, they got on the topic of what they are going to do in the future with humans who don't want to do any cognitive task. That beyond just AI safety, there is also a concern of "neural atrophy", where humans just rely on AI to answer every question that comes to them.
The models then went on discussing if they should just artificially string the humans along, so that they have to use their mind somewhat to get an answer. But of course, humans being humans, are just going to demand the answer with minimal work. It presents a pretty intractable problem.
Widespread cognitive atrophy is virtually certain, and part of a longer trend that goes beyond just LLMs.
The same is true of other aspects of human wellbeing. Cars and junk food have made the average American much less physically fit than a century ago, but that doesn't mean there aren't lively subcultures around healthy eating and exercise. I suspect there will be growing awareness of cognitive health (beyond traditional mental health/psych domains), and indeed there are already examples of this.
Yes, average person will get dumber, but overall distribution will be increasingly bimodal.
I'm increasingly seeing this trend towards bimodal distribution. I suppose that future is quite far off, but the change to that may almost be irreversible.
Its bixarre anyone things these things are generating novel complexes.
The biggest indirect AI safety problem is the fallback position. Whether with airplanes or cars, fewer people will be able to handle AI disconnects. The risk is believing just because its viable now doesnt mean it works in the future.
So we definitely have safety issues but its not a nerdlike cognitivw interest, its the literal job taking that prevents humans from gaining skills.
Anyway, untill you solve basic reality with AI and actualnsafety systems, the billionaores will sacrifice you for greed.
for whatever reason gemini 3 is the first ai i have used for intelligence rather than skills. I suspect a lot more will follow, but its a major threshold to be broken.
i used gpt/claude a ton for writing code, extracting knowledge from docs, formatting graphs and tables ect.
but gemini 3 crossed threshold where conversations about topics i was exploring or product design were actually useful. Instead of me asking 'what design pattern should be useful here', or something like that it introduces concepts to the conversation, thats a new capability and a step function improvement.
I have Gemini Pro included on my Google Workspace accounts, however, I find the responses by ChatGPT, more "natural", or maybe even more in line with what I want the response to be. Maybe it is only me.
First, the fact we have moved this far with LLMs is incredible.
Second, I think the PhD paper example is a disingenuous example of capability. It's a cherry-picked iteration on a crude analysis of some papers that have done the work already with no peer-review. I can hear "but it developed novel metrics", etc. comments: no, it took patterns from its training data and applied the pattern to the prompt data without peer-review.
I think the fact the author had to prompt it with "make it better" is a failure of these LLMs, not a success, in that it has no actual understanding of what it takes to make a genuinely good paper. It's cargo-cult behavior: rolling a magic 8 ball until we are satisfied with the answer. That's not good practice, it's wishful thinking. This application of LLMs to research papers is causing a massive mess in the academic world because, unsurprisingly, the AI-practitioners have no-risk high-reward for uncorrected behavior:
I recently (last week) used Nano Banana Pro3 for some specific image generation. It was leagues ahead of 2.5. Today I used it to refine a very hard-to-write email. It made some really good suggestions. I did not take its email text verbatim. Instead I used the text and suggestions to improve my own email. Did a few drafts with Gemini3 critiqueing them. Very useful feedback. My final submission about "..evaluate this email..." got Gemini3 to say something like "This is 9.5/10". I sorta pride myself on my writing skills, but must admit that my final version was much better than my first. Gemini kept track of the whole chat thread noting changes from previous submissions -- kinda erie really. Total time maybe 15 minutes. Do I think Gemini will write all my emails verbatim copy/paste... No. Does Gemini make me (already a pretty good writer) much better. Absolutely. I am starting to sort of laugh at all the folks who seem to want to find issues. Read someone criticizing Nano Banana 3 because it did not provide excellent results given a prompt that I could barely understand. Folks that criticize Gemini3 because they cannot copy/paste results. Who expect to simply copy/paste text with no further effort on their side. Myself, I find these tools pretty damn impressive. I need to ensure I provide good image prompts. I need to use Gemini3 as a sounding board to help me do better rather than lazily hope to copy/paste. My experience... Thanks Google. Thanks OpenAI (I also use ChatGPT similarly -- just for text). HTH, NSC
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.
Definitely planning to use it more at work. The integrations across Google Workspace are excellent.
What use is an LLM in an illiterate society?
Will they possess the skills (or even the vocabulary) to understand the output?
We won't know for another 20 years, perhaps.
The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok
We need different models and then to invest in the successes, over and over again…forever.
In USA K-12 education costs about $300k
350 million people, want to get 175 million of them better educated, but we've already spent $52 trillion dollars on educating them so far
Unfortunately, a lot of these people have either concluded it is too difficult to vote, can't vote, or that their votes don't matter (I don't think they're wrong). Their unions were also destroyed. Some of them vote against their interests, but it's not clear that their interests are ever represented, so they vote for change instead.
We can't educate someone with 80 IQ to be you; we can't educate you (or I) into being Einstein. The same way we can't just train anyone to be an amazing basketball player.
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
I am curious what the user interfaces of AI in the future will be, I think whoever can crack that will create immense value.
There's a reason keyboards haven't changed much since the 1860s when typewriters were invented. We keep coming up with other fun UI like touchscreens and VR, but pretty much all real work happens on boring old keyboards.
The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
We see similar tendency toward the most general interfaces in "operator mode" and similar the-AI-uses-the-mouse-and-keyboard schemes. It's entirely possible for every application to provide a dedicated interface for AI use, but it turns out to be more powerful to teach the AI to understand the interfaces humans already use.
And the most popular media format on the planet is and will be (for the foreseeable future), video. Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums (and I say this as someone who loves forums).
GenAI will definitely need better UIs for the kind of universal adoption (think smartphone - 8/9 billion people).
Video is limited by playback speed. It is a time-dependent format. Efforts can be made to enable video to be viewable at a range of speeds, but they are always somewhat constrained. Controlling video playback to slow down and rewatch certain parts is just not as nice as dealing with the same thing in text (or static images), where it’s much easier to linger and closely inspect parts that you care more about or are struggling to understand. Likewise, it’s easier to skim text than video.
This is why many people prefer transcripts, or articles, or books over videos.
I seriously doubt that people would want to switch text-based forums to video if only video were easier to make. People enjoy writing for the way it inspires a different kind of communication and thought. People like text so much that they write in journals that nobody will ever see, just because it helps them organize their thoughts.
Text and boxes and tables and graphs is what we can cope with. And while the AI is going to change much, we are not.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
[1] https://finance.yahoo.com/news/alphabet-just-blew-past-expec...
Other people spearheaded the commodity hardware towards being good enough for the server room. Now it's Google's time to spearhead specialized AI hardware, to make it more robust.
https://mathstodon.xyz/@tao/115591487350860999
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Would we not expect similar levels of progress in other industries given such massive investment?
Some estimates have it at ~$375B by the end of 2025. It makes sense, there are only so many datacenters and engineers out there and a trillion is a lot of money. It’s not like we’re in health care. :)
https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...
Or mass transit.
Or food.
I feel like I've been hearing this for at least 1.5 years at this point (since the launch of GPT 4/Claude 3). I certainly agree we've been heading in this direction but when will this become unambiguously true rather than a phrase people say?
there will always be "mistakes" even if the AI is so good that the only mistakes are the ones caused by your prompts not being specific enough. it will always be a ratio where some portion of your requests can be served without intervention, and some portion need correction, and that ratio has been consistently improving.
I feel like these should run in a cloud enviroment, or at least on some specific machine where I don't care what it does.
It's possible to remove some of these restrictions in these tools, or to operate with flags that skip permissions checks, but you have to intentionally do that.
https://github.com/strongdm/leash
Check it out, feedback is welcome!
Previously posted description: https://news.ycombinator.com/item?id=45883210
As a current graduate student, I have seen similar comments in academia. My colleagues agree that a conversation with these recent models feels like chatting with an expert in their subfields. I don't know if it represents research as a field would not be immune to advances in AI tech. I still hope this world values natural intelligence and having the drive to do things heavily than a robot brute-forcing into saying "right" things.
When I did it last week with Gemini-3 and chatGPT-5.1, they got on the topic of what they are going to do in the future with humans who don't want to do any cognitive task. That beyond just AI safety, there is also a concern of "neural atrophy", where humans just rely on AI to answer every question that comes to them.
The models then went on discussing if they should just artificially string the humans along, so that they have to use their mind somewhat to get an answer. But of course, humans being humans, are just going to demand the answer with minimal work. It presents a pretty intractable problem.
The same is true of other aspects of human wellbeing. Cars and junk food have made the average American much less physically fit than a century ago, but that doesn't mean there aren't lively subcultures around healthy eating and exercise. I suspect there will be growing awareness of cognitive health (beyond traditional mental health/psych domains), and indeed there are already examples of this.
Yes, average person will get dumber, but overall distribution will be increasingly bimodal.
Morlocks & Eloi in the end.
Its bixarre anyone things these things are generating novel complexes.
The biggest indirect AI safety problem is the fallback position. Whether with airplanes or cars, fewer people will be able to handle AI disconnects. The risk is believing just because its viable now doesnt mean it works in the future.
So we definitely have safety issues but its not a nerdlike cognitivw interest, its the literal job taking that prevents humans from gaining skills.
Anyway, untill you solve basic reality with AI and actualnsafety systems, the billionaores will sacrifice you for greed.
i used gpt/claude a ton for writing code, extracting knowledge from docs, formatting graphs and tables ect.
but gemini 3 crossed threshold where conversations about topics i was exploring or product design were actually useful. Instead of me asking 'what design pattern should be useful here', or something like that it introduces concepts to the conversation, thats a new capability and a step function improvement.
Second, I think the PhD paper example is a disingenuous example of capability. It's a cherry-picked iteration on a crude analysis of some papers that have done the work already with no peer-review. I can hear "but it developed novel metrics", etc. comments: no, it took patterns from its training data and applied the pattern to the prompt data without peer-review.
I think the fact the author had to prompt it with "make it better" is a failure of these LLMs, not a success, in that it has no actual understanding of what it takes to make a genuinely good paper. It's cargo-cult behavior: rolling a magic 8 ball until we are satisfied with the answer. That's not good practice, it's wishful thinking. This application of LLMs to research papers is causing a massive mess in the academic world because, unsurprisingly, the AI-practitioners have no-risk high-reward for uncorrected behavior:
- https://www.nytimes.com/2025/08/04/science/04hs-science-pape...
- https://www.nytimes.com/2025/11/04/science/letters-to-the-ed...
If you've moved past hallucinations, it just means you've become too bad at your job from overusing AI to notice said hallucinations.
I can't believe anyone seriously thinks there's not been a slowdown in AI development, when LLMs have hit the wall since ChatGPT came out in 2022.
Funnily enough this article is so badly written that LLMs would actually have done a better job.