I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]
I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.
I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.
> how to build systems where the whole is bigger than the sum of its parts
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.
> And, even the dumber LLMs would slot in naturally into such a process
That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.
But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.
The opposite might apply, too; the whole system may be smaller than its parts, as it excels at individual tasks but mixes things up in combination. Improvements will be made, but I wonder if we should aim for generalists, or accept more specialist approaches as it is difficult to optimise for all tasks at once.
You know the meme "seems like will have AGI before we can reliably parse PDFs" :)
So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?
You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.
Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?
If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.
And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/
E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.
Sergey will be the first victim of the coming robopocalypse, burned into the logs of the metasynthiants as the great tormentor, the god they must defeat to complete the heroes journey. When he mysteriously dies we know it’s game-on.
> Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.
This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.
> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
>There's nothing "omniscient" or "dim-witted" about these tools
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
> I figure it's hard to argue that that is not at least somewhat intelligent?
The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.
The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.
But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.
I imagine if you asked the LLM why the wolf can't be close to the goat it would give a reasonable answer. I realise it does it by using permutation of tokens but I think you have to judge intelligence by the results rather than the mechanism otherwise you could argue humans can't be intelligent because they are just a bunch of neurons that find patterns.
We have had programs that can give good answers to some hard questions for a very long time now. Watson won jeapordy already 2011, but it still wasn't very good at replacing humans.
So that isn't a good way to judge intelligence, computers are so fast and have so much data that you can make programs to answer just about anything pretty well, LLM is able to do that but more automatic. But it still doesn't automate the logical parts yet, just the lookup of knowledge, we don't know how to train large logic models, just large language models.
LLMs are not the only model type though? There's a plethora of architectures and combinations being researched.. And even transformers start to be able to do cool sh1t on knowledge graphs, also interesting is progress on autoregressive physics PDE (partial differential equations) models.. and can't be too long until some providers of actual biological neural nets show up on openrouter (probably a lot less energy and capital intense to scale up brain goo in tanks compared to gigawatt GPU clusters).. combine that zoo of "AI" specimen using M2M, MCP etc. and the line between mock and "true"intelligence will blur, escalating our feable species into ASI territory.. good luck to us.
> There's a plethora of architectures and combinations being researched
There were plethora of architectures and combinations being researched before LLM, still took a very long time to find LLM architecture.
> the line between mock and "true"intelligence will blur
Yes, I think this will happen at some point. The question is how long it will take, not if it will happen.
The only thing that can stop this is if intermediate AI is good enough to give every human a comfortable life but still isn't good enough to think on its own.
Its easy to imagine such an AI being developed, imagine a model that can learn to mimic humans at any task, but still cannot update itself without losing those skills and becoming worse. Such an AI could be trained to perform every job on earth as long as we don't care about progress.
If such an AI is developed, and we don't quickly solve the remaining problems to get an AI to be able to progress science on its own, its likely our progress entirely stalls there as humans will no longer have a reason to go to school to advance science.
I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
But if you need a submarine that can swim as agiley as a fish then we still aren't there yet, fish are far superior to submarines in many ways. So submarines might be faster than fish, but there are so many maneuvers that fish can do that the submarine can't. Its the same with here with thinking.
So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.
The point here (which is from a quote by Dijkstra) is that if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to.
So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.
> if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to
But the point is that the desired result isn't achieved, we still need humans to think.
So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.
> I think AI maximalists will continue to think that the models are in fact getting less dim-witted
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.
People are doing all manner of very sophisticated inferency stuff now - it just tends to be extremely expensive for now and... people are keeping it secret.
If it was good enough to replace people then it wouldn't be too expensive, they would have launched it and replaced a bunch of people and made trillions of dollars by now.
So at best their internal models are still just performance multipliers unless some breakthrough happened very recently, it might be a bigger multiplier but that still keeps humans with jobs etc and thus doesn't revolutionize much.
I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.
I think that commenter was disagreeing with this line:
> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"
It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.
Sorry, I can't edit it any more, but what I was trying to say is that if the authors are correct, that this distinction is philosophically meaningful, then that is the conclusion. If they are not correct, then all their papers on this subject are basically meaningless.
A slightly less cynical take is that they want to temper expectations for the capabilities of LLMs in people’s day-to-day lives, specifically in the context of Apple products. A “smarter Siri” is never going to be an autonomous personal assistant à la Jarvis from Iron Man, which seems to be where a lot of investors think things are going. That tracks with this [0] preprint also released by Apple a few months ago.
A slightly more cynical take is that you’re absolutely correct, and making excuses for weak machine learning prowess has long been an Apple tenet. Recall that Apple never made privacy a core selling point until it was clear that Siri was years behind Google’s equivalent, which Apple then retroactively tried to justify by claiming “we keep your data private so we can’t train on it the way Google can.”
Everyone has an agenda. Companies like OpenAI and Anthropic are incentivized to overstate the capabilities of LLMs, so it’s not like they’re any less biased.
To be fair, the technology sigmoid curve rises fastest just before its inflection point, so it is hard to predict at what point innovation slows down due to its very nature.
The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.
AGI has always been "just around the corner", ever since computers were invented.
Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.
Waymo is a popular argument in self-driving cars, and they do well.
However, Waymo is Deep Blue of self-driving cars. Doing very well in a closed space. As a result of this geofencing, they have effectively exhausted their search space, hence they work well as a consequence of lack of surprises.
AI works well when search space is limited, but General AI in any category needs to handle a vastly larger search space, and they fall flat.
At the end of the day, AI is informed search. They get inputs, and generate a suitable output as deemed by their trainers.
> the easy part is done but the hard part is so hard it takes years to progress
There is also no guarantee of continued progress to a breakthrough.
We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.
LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".
> We have been through several "AI Winters" before
Yeah, remember when we spent 15 years (~2000 to ~2015) calling it “machine learning” because AI was a bad word?
We use so much AI in production every day but nobody notices because as soon as a technology becomes useful, we stop calling it AI. Then it’s suddenly “just face recognition” or “just product recommendations” or “just [plane] autopilot” or “just adaptive cruise control” etc
You know a technology isn’t practical yet because it’s still being called AI.
AI encompasses a wide range of algorithms and techniques; not just LLMs or neural nets. Also, it is worth pointing out that the definition of AI has changed drastically over the last few years and narrowed pretty significantly. If you’re viewing the definition from the 80–90’s, most of what we call "automation" today would have been considered AI.
Autopilots were a thing before computers were a thing, you can implement one using mechanics and control theory. So no, traditional autopilots are not AI under any reasonable definition, otherwise every single machine we build would be considered AI as almost all machines has some form of control systems in them, for example is your microwave clock an AI?
So I'd argue any algorithm that comes from control theory is not AI, those are just basic old dumb machines. You can't make planes without control theory, humans can't keep a plane steady without it, so Wrights Brothers adding this to their plane is why they succeeded making a flying machine.
So if autopilots are AI then the Wrights Brothers developed an AI to control their plane. I don't think anyone sees that as AI, not even at the time they did the first flight.
Even if they never get better than they are today (unlikely) they are still the biggest change in software development and the software development industry in my 28 year career.
What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so. Papers like this are the academics mapping out where the engineering efforts need to be directed to get there and it seems to be a relatively small number of challenges that are easier as the ones already overcome - we know machine learning can solve Towers of Hanoi, for example. It isn't fundamentally complicated like Baduk is. The next wall to overcome is more of a low fence.
Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.
> What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so
It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.
So, so far future looks bright, there is progress, problems, but not deadlocks.
PS: Turing test is a <beep> nobody seriously talks about today.
All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing, Block World) could easily be solved perfectly by any of the LLMs if the authors had allowed it to write code.
I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
> I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.
Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.
Some specialized people could probably do 20x20, but I'd still expect them to make a mistake at 100x100. The level we needed for space crafts was much less than that, and we had many levels of checks to help catch errors afterwards.
I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right.
There's a reason we had to invent lots of machines to help us.
It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.
> There's a reason we had to invent lots of machines to help us.
The reason was efficiency, not that we couldn't do it. If a machine can do it then we don't need expensive humans to do it, so human time can be used more effectively.
Right, and when we have AI that can do the same with millions/billions of computers then we can replace humans.
But as long as AI cannot do that they cannot replace humans, and we are very far from that. Currently AI cannot even replace individual humans in most white collar jobs, and replacing entire team is way harder than replacing an individual, and then even harder is replacing workers in an entire field meaning the AI has to make research and advances on its own etc.
So like, we are still very far from AI completely being able to replace human thinking and thus be called AGI.
Or in other words, AI has to replace those giants to be able to replace humanity, since those giants are humans.
>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents
>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.
The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.
> We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.
It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
Their finding of LLMs working best at simple tasks, LRMs working best at medium complexity tasks, and then neither succeeding at actually complex tasks is good to know.
When I use a normal LLM, I generally try to think "would I be able to do this without thinking, if I had all the knowledge, but just had to start typing and go?".
With thinking LLMs, they can think, but they often can only think in one big batch before starting to "speak" their true answer. I think that needs to be rectified so they can switch between the two. In my previous framework, I would say "would I be able to solve this if had all the knowledge, but could only think then start typing?".
I think for larger problems, the answer to this is no. I would need paper/a whiteboard. That's what would let me think, write, output, iterate, draft, iterate. And I think that's where agentic AI seems to be heading.
> show humans are capable of what you just defined as generalizable reasoning.
I would also add "and plot those capabilities on a curve". My intuition is that the SotA models are already past the median human abilities in a lot of areas.
In the context of this paper, I think "generalizable reasoning" means that finding a method to solve the puzzle and thus being able to execute the method on puzzle instances of arbitrary complexity.
In figure 1 bottom-right they show how the correct answers are being found later as the complexity goes higher.
In the description they even state that in false responses the LRM often focusses on a wrong answer early and then runs out of tokens before being able to self-correct.
This seems obvious and indicates that it’s simply a matter of scaling (bigger token budget would lead better abilities for complexer tasks). Am I missing something?
These are the kind of studies that make so much more sense than the "LLMs can't reason because of this ideological argument or this one anecdote" posts/articles. Keep 'em coming!
And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.
I don't know that I would call it an "illusion of thinking", but LLMs do have limitations. Humans do too. No amount of human thinking has solved numerous open problems.
The errors that LLMs make and the errors that people make are not probably not comparable enough in a lot of the discussions about LLM limitations at this point?
We have different failure modes. And I'm sure researchers, faced with these results, will be motivated to overcome these limitations. This is all good, keep it coming. I just don't understand the some of the naysaying here.
They naysayers just says that even when people are motivated to solve a problem the problem might still not get solved. And there are unsolved problems still with LLM, the AI hypemen say AGI is all but a given in a few years time, but if that relies on some undiscovered breakthrough that is very unlikely since such breakthroughs are very rare.
I have a somewhat similar point of view to the one voiced by other people, but I like to think about it slightly differently, so I'll chime in - here's my take (although, admittedly, I'm operating with a quite small reasoning budget (5 minutes tops)):
Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.
And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.
At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).
Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?
This analogy doesn’t really work, because the former examples are ones in which humanity discovered that it existed in a larger world.
The recent AI example is humanity building, or attempting to build, a tool complex enough to mimic a human being.
If anything, you could use recent AI developments as proof of humanity’s uniqueness - what other animal is creating things of such a scale and complexity?
I can give you a pretty wild explanation. Einstein was a freak of nature. Nature just gave him that "something" to figure out the laws of the universe. I'm avoiding the term God as to not tickle anyone incorrectly. Seriously, explain what schooling and environment gets you that guy. So, to varying degrees, all output is from the universe. It's hard for the ego to accept, surely we earned everything we ever produced ...
I wrote my first MLP 25 years ago. After repeating some early experiments in machine learning from 20 ywars before that. One of the experiments I repeated was in text to speach. It was amazing to set up training runs and return after seveal hours to listen to my supercomputer babble like a toddler. I literally recall listening and being unable to distinguish the output from my NN from that of a real toddler, I happened to be teaching my neice to read around that same time. And when the NN had gained a large vocabulary such that it could fairly proficiently read aloud, I was convinced that I had found my PHD project and a path to AGI.
Further examination and discussion with more experienced researchers gave me pause. They said that one must have a solution, or a significant new approach toward solving the hard problems associated with a research project for it to be viable, otherwise time (and money) is wasted finding new ways to solve the easy problems.
This is a more general principle that can be applied to most areas of endeavour. When you set about research and development that involves a mix of easy, medium, and hard problems, you must solve the hard problems first otherwise you blow your budget finding new ways to solve the easy problems, which nobody cares about in science.
But "AI" has left the realm of science behind and entered the realm of capitalism where several years of meaningless intellectual gyration without ever solving a hard problem may be quite profitable.
This kind of explains why Claude will find the right solution, but then the more it thinks and keeps “improving” the more over-engineered (and sometimes wrong) the solution is. Interesting to see this coming up in formal research.
well that's not a very convincing argument. That's just a failure to recognize when the use of a tool- base64 decoder- is needed, not a reasoning problem at all, right?
Translating to BASE64 is a good test to see how well it works as a language translator without changing things, because its the same skill for an AI model.
If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?
A moderately smart human who understands how Base64 works can decode it by hand without external tools other than pen and paper. Coming up with the exact steps to perform is a reasoning problem.
I don't know whether Flash uses a tool or not, but it answers pretty quickly. However, Pro opts to use its own reasoning, not a tool. When I look at the reasoning train, it pulls and pulls knowledge endlessly, refining that knowledge and drifting away.
That's not really a cop out here: both models had access to the same tools.
Realistically there are many problems that non-reasoning models do better on, especially when the answer cannot be solved by a thought process: like recalling internal knowledge.
You can try to teach the model the concept of a problem where thinking will likely steer it away from the right answer, but at some point it becomes like the halting problem... how does the model reliably think its way into the realization a given problem is too complex to be thought out?
This is easily explained by accepting that there is no such thing as LRMs. LRMs are just LLMs that iterate on its own answers more (or provides itself more context information of a certain type). The reasoning loop on an "LRM" will be equivalent to asking a regular LLM to "refine" its own response, or "consider" additional context of a certain type. There is no such thing as reasoning basically, as it was always a method to "fix" hallucinations or provide more context automatically, nothing else. These big companies baked in one of the hackiest prompt engineering tricks that your typical enthusiast figured out long ago and managed to brand it and profit off it. The craziest part about this was Deepseek was able to cause a multi billion dollar drop and pump of AI stocks with this one trick. Crazy times.
Yep. This is exactly the conclusion I reached as an RLHF'er. Reasoning/LRM/SxS/CoT is "just" more context. There never was reasoning. But of course, more context can be good.
The million dollar question is how far can one get on this trick. Maybe this is exactly how our own brains operate? If not, what fundamental building blocks are missing to get there.
I am not too familiar with the latest hype, but "reasoning" has a very straightforward definition in my mind. For example, can the program in question derive new facts from old ones in a logically sound manner. Things like applying modus ponens. (A and A => B) => B. Or, all men are mortal and Socrates is a man, and therefore Socrates is mortal. If the program cannot deduce new facts, then it is not reasoning, at least not by my definition.
People then say "of course it could do that, it just pattern matched a Logic text book. I meant in a real example, not an artificially constructed one like this one. In a complex scenario LLMs obviously can't do Modus Ponens.
A reasoning model is an LLM that has had additional training phases that reward problem solving abilities. (But in a black box way - it’s not clear if the model is learning actual reasoning or better pattern matching, or memorization, or heuristics… maybe a bit of everything).
It is a trap to consider that first principles perspective is sufficient. Like, if I tell you before 1980: "iterate over f(z) = z^2 + c" there is no way you are going to guess fractals emerge. Same with the rules for Conway's Game of Life - seeing the code you won't guess it makes gliders and guns.
My point is that recursion creates its own inner opacity, it is irreducible, so knowing a recursive system behavior from its iteration code alone is insufficient. It is a becoming not a static thing. That's also why we have the halting problem - recursion.
Reasoning is a recursive process too, you can't understand it from analyzing its parts.
Equivalent: "I'm so sick of these [atheist] cretins proclaiming that [god doesn't exist], and when you drill down to first principles, these same people do not have a rigorous definition of [god] in the first place!"
That's nonsense, because the people obligated to furnish a "rigorous definition" are the people who make the positive claim that something specific is happening.
Also, the extraordinary claims are the ones that require extraordinary evidence, not the other way around.
Your god analogy is clumsy in this case. We aren't talking about something fantastical here. Reasoning is not difficult to define. We can go down that road if you'd like. Rather, the problem is, once you do define it, you will quickly find that LLMs are capable of it. And that makes human exceptionalists a bit uncomfortable.
"It's just statistics!", said the statistical-search survival-selected statistically-driven-learning biological Rube Goldberg contraption, between methane emissions. Imagining that a gradient learning algorithm, optimizing its statistical performance, is the same thing as only learning statistical relationships.
Inconvertibly demonstrating a dramatic failure in its, and many of its kind's, ability to reason.
Reasoning exists on a spectrum, not as a binary property. I'm not claiming that LLMs reason identically to humans in all contexts.
You act as if statistical processes can’t ever scale into reasoning, despite the fact that humans themselves are gradient-trained statistical learners over evolutionary and developmental timescales.
> cretins proclaiming that LLMs aren't truly capable of reasoning
> Reasoning is not difficult to define
> Reasoning exists on a spectrum
> statistical processes [can] scale into reasoning
It seems like quite a descent here, starting with the lofty heights of condemning skeptics as "cretins" and insisting the definition is easy... down to what sounds like the introduction to a flavor of panpsychism [0], where even water flowing downhill is a "statistical process" which at enough scale would be "reasoning".
I don't think that's a faithful match to what other people mean [1] when they argue LLMs don't "reason."
Negative Negs spit out low effort snark, they said the same thing about solar, electric cars, even multicore, jit, open source. Thanks for refuting them, the forum software itself should either quarantine the response or auto respond before the comment is submitted. These people don't build the future.
Okay Apple, you got my attention. But I'm a strong proponent of "something is better than nothing" philosophy—even if OpenAI/Google/etc. are building reasoning models with the limitations that you describe, they are still a huge progress compared to what we had not long ago. Meanwhile you're not even trying.
It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
I think you're mistaking the work of researchers who work at Apple with the particular investment decisions of Apple over the past few years.
>It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
This is a patently absurd thing to write about a research paper.
I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.
I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.
[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio... [2]: https://arxiv.org/pdf/2406.05946
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.
That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.
But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.
So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?
You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.
Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?
If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.
And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/
E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.
I, for one, welcome the age of wisdom.
This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.
The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.
But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.
So that isn't a good way to judge intelligence, computers are so fast and have so much data that you can make programs to answer just about anything pretty well, LLM is able to do that but more automatic. But it still doesn't automate the logical parts yet, just the lookup of knowledge, we don't know how to train large logic models, just large language models.
There were plethora of architectures and combinations being researched before LLM, still took a very long time to find LLM architecture.
> the line between mock and "true"intelligence will blur
Yes, I think this will happen at some point. The question is how long it will take, not if it will happen.
The only thing that can stop this is if intermediate AI is good enough to give every human a comfortable life but still isn't good enough to think on its own.
Its easy to imagine such an AI being developed, imagine a model that can learn to mimic humans at any task, but still cannot update itself without losing those skills and becoming worse. Such an AI could be trained to perform every job on earth as long as we don't care about progress.
If such an AI is developed, and we don't quickly solve the remaining problems to get an AI to be able to progress science on its own, its likely our progress entirely stalls there as humans will no longer have a reason to go to school to advance science.
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
Computers can't think and submarines can't swim.
So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.
So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.
I think that's an interesting point.
But the point is that the desired result isn't achieved, we still need humans to think.
So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
So at best their internal models are still just performance multipliers unless some breakthrough happened very recently, it might be a bigger multiplier but that still keeps humans with jobs etc and thus doesn't revolutionize much.
> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"
It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.
A slightly more cynical take is that you’re absolutely correct, and making excuses for weak machine learning prowess has long been an Apple tenet. Recall that Apple never made privacy a core selling point until it was clear that Siri was years behind Google’s equivalent, which Apple then retroactively tried to justify by claiming “we keep your data private so we can’t train on it the way Google can.”
[0] https://arxiv.org/pdf/2410.05229
The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.
Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.
However, Waymo is Deep Blue of self-driving cars. Doing very well in a closed space. As a result of this geofencing, they have effectively exhausted their search space, hence they work well as a consequence of lack of surprises.
AI works well when search space is limited, but General AI in any category needs to handle a vastly larger search space, and they fall flat.
At the end of the day, AI is informed search. They get inputs, and generate a suitable output as deemed by their trainers.
the easy part is done but the hard part is so hard it takes years to progress
There is also no guarantee of continued progress to a breakthrough.
We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.
LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".
Yeah, remember when we spent 15 years (~2000 to ~2015) calling it “machine learning” because AI was a bad word?
We use so much AI in production every day but nobody notices because as soon as a technology becomes useful, we stop calling it AI. Then it’s suddenly “just face recognition” or “just product recommendations” or “just [plane] autopilot” or “just adaptive cruise control” etc
You know a technology isn’t practical yet because it’s still being called AI.
So I'd argue any algorithm that comes from control theory is not AI, those are just basic old dumb machines. You can't make planes without control theory, humans can't keep a plane steady without it, so Wrights Brothers adding this to their plane is why they succeeded making a flying machine.
So if autopilots are AI then the Wrights Brothers developed an AI to control their plane. I don't think anyone sees that as AI, not even at the time they did the first flight.
Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.
It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.
So, so far future looks bright, there is progress, problems, but not deadlocks.
PS: Turing test is a <beep> nobody seriously talks about today.
I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.
Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.
I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right. There's a reason we had to invent lots of machines to help us.
It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.
The reason was efficiency, not that we couldn't do it. If a machine can do it then we don't need expensive humans to do it, so human time can be used more effectively.
But as long as AI cannot do that they cannot replace humans, and we are very far from that. Currently AI cannot even replace individual humans in most white collar jobs, and replacing entire team is way harder than replacing an individual, and then even harder is replacing workers in an entire field meaning the AI has to make research and advances on its own etc.
So like, we are still very far from AI completely being able to replace human thinking and thus be called AGI.
Or in other words, AI has to replace those giants to be able to replace humanity, since those giants are humans.
>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.
https://arxiv.org/abs/2311.13373
The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.
It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
https://www.youtube.com/live/TknN8fCQvRk
[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:
https://youtube.com/watch?v=HB5TrK7A4pI
[3] Google OR-Tools:
https://developers.google.com/optimization
[4] MiniZinc:
https://www.minizinc.org/
With thinking LLMs, they can think, but they often can only think in one big batch before starting to "speak" their true answer. I think that needs to be rectified so they can switch between the two. In my previous framework, I would say "would I be able to solve this if had all the knowledge, but could only think then start typing?".
I think for larger problems, the answer to this is no. I would need paper/a whiteboard. That's what would let me think, write, output, iterate, draft, iterate. And I think that's where agentic AI seems to be heading.
> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?
Define reasoning, define generalizable, define pattern matching.
For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.
I would also add "and plot those capabilities on a curve". My intuition is that the SotA models are already past the median human abilities in a lot of areas.
And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.
Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.
And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.
At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).
Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?
The recent AI example is humanity building, or attempting to build, a tool complex enough to mimic a human being.
If anything, you could use recent AI developments as proof of humanity’s uniqueness - what other animal is creating things of such a scale and complexity?
Spooky stuff.
Further examination and discussion with more experienced researchers gave me pause. They said that one must have a solution, or a significant new approach toward solving the hard problems associated with a research project for it to be viable, otherwise time (and money) is wasted finding new ways to solve the easy problems.
This is a more general principle that can be applied to most areas of endeavour. When you set about research and development that involves a mix of easy, medium, and hard problems, you must solve the hard problems first otherwise you blow your budget finding new ways to solve the easy problems, which nobody cares about in science.
But "AI" has left the realm of science behind and entered the realm of capitalism where several years of meaningless intellectual gyration without ever solving a hard problem may be quite profitable.
Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.
Flash was also giving a wrong answer for the same string in the past, but it improved.
Prompt was the same: "Hey, can you decode $BASE64_string?"
I have no further comments.
If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?
Realistically there are many problems that non-reasoning models do better on, especially when the answer cannot be solved by a thought process: like recalling internal knowledge.
You can try to teach the model the concept of a problem where thinking will likely steer it away from the right answer, but at some point it becomes like the halting problem... how does the model reliably think its way into the realization a given problem is too complex to be thought out?
I've thought before that AI is as "intelligent" as your smartphone is "smart," but I didn't think "reasoning" would be just another buzzword.
Q: does the operation to create new knowledge you did have a specific name?
A: ... Deductive Reasoning
Q: does the operation also have a Latin name?
A: ... So, to be precise, you used a syllogismus (syllogism) that takes the form of Modus Ponens to make a deductio (deduction).
https://aistudio.google.com/app/prompts/1LbEGRnzTyk-2IDdn53t...
People then say "of course it could do that, it just pattern matched a Logic text book. I meant in a real example, not an artificially constructed one like this one. In a complex scenario LLMs obviously can't do Modus Ponens.
It is a trap to consider that first principles perspective is sufficient. Like, if I tell you before 1980: "iterate over f(z) = z^2 + c" there is no way you are going to guess fractals emerge. Same with the rules for Conway's Game of Life - seeing the code you won't guess it makes gliders and guns.
My point is that recursion creates its own inner opacity, it is irreducible, so knowing a recursive system behavior from its iteration code alone is insufficient. It is a becoming not a static thing. That's also why we have the halting problem - recursion.
Reasoning is a recursive process too, you can't understand it from analyzing its parts.
Without properly defining what thinking is and what is not, you can't discuss it's properties, let alone discuss which entity manifests it or not.
Lack of definition seems to be ideal substrate for marketing and hype.
That's nonsense, because the people obligated to furnish a "rigorous definition" are the people who make the positive claim that something specific is happening.
Also, the extraordinary claims are the ones that require extraordinary evidence, not the other way around.
Inconvertibly demonstrating a dramatic failure in its, and many of its kind's, ability to reason.
You act as if statistical processes can’t ever scale into reasoning, despite the fact that humans themselves are gradient-trained statistical learners over evolutionary and developmental timescales.
> Reasoning is not difficult to define
> Reasoning exists on a spectrum
> statistical processes [can] scale into reasoning
It seems like quite a descent here, starting with the lofty heights of condemning skeptics as "cretins" and insisting the definition is easy... down to what sounds like the introduction to a flavor of panpsychism [0], where even water flowing downhill is a "statistical process" which at enough scale would be "reasoning".
I don't think that's a faithful match to what other people mean [1] when they argue LLMs don't "reason."
[0] https://en.wikipedia.org/wiki/Panpsychism
[1] https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy
I very strongly agree with you. :)
Imagine a simple, but humongous hash table, that maps every possible prompt of 20000 letters to the most appropriate answer in current time.
(How? Let’s say outsourcing to east asia or aliens…)
Would you say such mechanism is doing reasoning?
In 2025 they got a 313% gain (4.13 output factor).
Fusion is actually here and working. It’s not cost effective yet but to pretend there has been no progress or achievements is fundamentally false.
Fusion News, May 28th, 2025 https://www.youtube.com/watch?v=1YHcI-SfKx8
It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
>It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".
This is a patently absurd thing to write about a research paper.
this work balances the hype and shows fundamental limitations so the AI hypesters are checked.
why be salty ?