Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Why did that experience lead you to that conclusion?
I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."
You are wrong though, reasoning produces way better results as evidenced by benchmarks. The amount of bullshitting reduces drastically - just like when humans think before giving an answer.
Why should we accept your anecdotal evidence in favour of statistical evidence on the contrary?
I doubt the whole concept of calling it "thinking" or "reasoning". If it's automated context engineering, call it that. The bullshit is in the terms used.
Because it unnecessarily anthropomorphizes it to create the illusion that there is intelligence behind it (in the traditional sense, where the intelligence is synonymous with being aware of its existence and embodied)
But I personally don't have a big problem with the term in this context. Our industry have been using misleading terms since the beginning to describe things that somewhat resemble whatever it's called after.
earnest question, hope this does not come off as skeptical of the skeptical position on ai and especially their salesmen. I ask because I share your skepticism, not because i think it's silly, to be clear.
does it ever occur to your types of commenters (derisive of an entire field because of personal experience) that some people who talk about stuff like control systems/ai/safety recognize this, and it's actually why they want sensible policies surrounding the tech?
not because they're afraid of skynet, but because they observe both, the reading comprehension statistics of a populace over time, and the technological rate of progress?
tech very clearly doesn't have to be a god to do serious societal damage... e.g. fossil fuel use alone...social media has arguably done irreparable harm with fairly simple algorithms... the ottomans went to great lengths to keep the printing press from their empire, and certainly not because it was bullshit or god.
Or do you recognize those types and classify them as a negligible minority?
People often see liars or propagandists as required to take an extreme position. For a recent example that actually played out, let's say I'm a media site that wants to run a publicity campaign for Sam Bankman-Fried after it came out that he's a conman, in part because he previously donated large sums of money to me and/or interests I care about.
Does that mean I now evangelize him like he's the most amazing and noble person ever? No, because that reeks of insincerity. Instead, you acknowledge the issues, and then aim to 'contextualize' them. It's not 'a person of minimal ethical compass doing scummy things because of a lust for money', but instead it's him being misguided or misled - perhaps a naive genius, who was genuinely trying in earnest to do the right thing, but found himself in over his head. It's no longer supposed to be basic white collar crime but a 'complex and nuanced issue.'
And it's the same thing in all domains. Somebody taking a 'nuanced' position does not mean they actually care at all about the nuance, but that they may believe that as being the most effective way of convincing you to do, or believe, what they want you to. And the worst part is that humanity is extremely good at cognitive dissonance. The first person a very good liar convinces is himself.
I can’t speak for ares623 but there are some people that don’t agree that the software that generates text that agrees with everything that you say if you say it twice is the same thing as the printing press.
It’s like if you imagine that the slot machine was just invented and because of enormous advertising and marketing campaigns it has become hard to tell the difference between marketing material written by the slot machine manufacturers and stuff written by folks that really really like pulling the lever on the slot machine
Reasoning implies (limited) understanding of the context. There is none of that. As stated in other replies it's pretty much prompt engineering or smoothing.
Your definition of "reasoning" is doing a lot of the heavy lifting. No one is claiming this reasoning is analogous to human reasoning. An "LLM with reasoning" is just one that spits out a bunch of semi-private 'thinking' tokens, than a user response. No one is trying to claim "it reasons and understands like a human". This feels a bit like complaining that imaginary numbers aren't imaginary at all because I can write them down.
The problem is that this "comparison" is being used both ways, on one hand LLM leaders tell you "smarter than the smartest", and then it makes very pretty obvious mistakes and the leaders are like even an "average" (dumb) humans can/will make the same mistake.
LLMs have jagged capabilities, as AIs tend to do. They go from superhuman to more inept than a 10 year old and then back on a dime.
Really, for an AI system, the LLMs we have are surprisingly well rounded. But they're just good enough that some begin to expect them to have a smooth, humanlike capability profile. Which is a mistake.
Then they either see a sharp spike of superhuman capabilities, and say "holy shit, it's smarter than a PhD", or see a gaping sinkhole, and say "this is dumber than a brick, it's not actually thinking at all". Both are wrong but not entirely wrong. They make the right observations and draw the wrong conclusions.
I think the capability of something or somebody, in a given domain, is mostly defined by their floor, not their ceiling. This is probably true in general but with LLMs it's extremely true due to their self recursion. Once they get one thing wrong, they tend to start basing other things on that falsehood to the point that I often find that when they get something wrong, you're far better off just starting with a new context instead of trying to correct them.
With humans we don't really have to care about this because our floor and our ceiling tend to be extremely close, but obviously that's not the case for LLMs. This is made especially annoying with ChatGPT which seems to be being intentionally designed to convince you that you're the most brilliant person to have ever lived, even when what you're saying/doing is fundamentally flawed.
Consistency drive. All LLMs have a desire for consistency, right at the very foundation at their behavior. The best tokens to predict are the ones that are consistent with the previous tokens, always.
Makes for a very good base for predicting text. Makes them learn and apply useful patterns. Makes them sharp few-shot learners. Not always good for auto-regressive reasoning though, or multi-turn instruction following, or a number of other things we want LLMs to do.
So you have to un-teach them maladaptive consistency-driven behaviors - things like defensiveness or error amplification or loops. Bring out consistency-suppressed latent capabilities - like error checking and self-correction. Stitch it all together with more RLVR. Not a complex recipe, just hard to pull off right.
LLMs have no desire for anything. They're algorithms and this anthropomorphicization is nonsense.
And no, the best tokens to predict are not "consistent", based on what the algorithm would perceive, with the previous tokens. The goal is for them to be able to generate novel information self-expand their 'understanding'. All you're describing is a glorified search/remix engine, which indeed is precisely what LLMs are, but not what the hype is selling them as.
In other words, the concept of the hype is that you train them on the data just before relativity and they should be able to derive relativity. But of course that is in no way whatsoever consistent with the past tokens because it's an entirely novel concept. You can't simply carry out token prediction, but actually have have some degree of logic, understanding, and so on - things which are entirely absent, probably irreconcilably so, from LLMs.
Not anthropomorphizing LLMs is complete and utter nonsense. They're full of complex behaviors, and most of them are copied off human behavior.
It seems to me like this is just some kind of weird coping mechanism. "The LLM is not actually intelligent" because the alternative is fucking terrifying.
No they are not copied off of human behavior in any way shape or fashion. They are simply mathematical token predictors based on relatively primitive correlations across a large set of inputs. Their success is exclusively because it turns out, by fortunate coincidence, that our languages are absurdly redundant.
Change their training content to e.g. stock prices over time and you have a market prediction algorithm. That the next token being predicted is a word doesn't suddenly make them some sort of human-like or intelligent entity.
"No they are not copied off of human behavior in any way shape or fashion."
The pre-training phase produces the next-token predictors. The post-training phase is where its shown examples of selected human behavior for it to imitate - examples of conversation patterns, expert code production, how to argue a point... there's an enormous amount of "copying human behavior" involved in producing a useful LLM.
No you're not. Humans started with literally nothing, not even language. We went from an era with no language and with the greatest understanding of technology being 'poke them with the pointy side' to putting a man on the Moon, unlocking the secrets of the atom, and much more. And given how inefficiently we store and transfer knowledge, we did it in what was essentially the blink of an eye.
Give an LLM the entire breadth of human knowledge at the time and it would do nothing except remix what we knew at that point in history, forever. You could give it infinite processing power, and it's still not moving beyond 'poke them with the pointy side.'
It cannot be both. A system with superhuman capabilities cannot make basic mistakes consistently. (like forgetting a name as it moves from generating 1st line to 3rd line).
LLMs are a great tool, but the narrative around them is not healthy and will burn a lot of real users.
If I had to fight Deep Blue and win? I'd pick a writing contest over a game of chess.
For AIs, having incredibly narrow capabilities is the norm rather than an exception. That doesn't make those narrow superhuman AIs any less superhuman. I could spend a lifetime doing nothing but learning chess and Deep Blue would still kick my shit in on the chessboard.
> A system with superhuman capabilities cannot make basic mistakes consistently
That sounds like a definition you just made up to fit your story. A system can both make bigger leaps in a field where the smartest human is unfamiliar and make dumber mistakes than a 10 year old. I can say that confidently, because we have such systems. We call them LLMs.
It's like claiming that it can't both be sunny and rainy. Nevertheless, it happens.
Yeah I don't know what your definition of human is, but in my definition of when comparing something to an average human, knowing a name is an innate quality. If a human is consistently forgetting names I will think something is wrong with that human that they are unable to remember names.
I think you should work with a bunch of highly respected PhD researchers. This is a quality many share - the classic “can solve super hard problems but can’t tie their shoes” is a trope because versions of it ring true. This is not to say what LLMs are doing is thinking per se, but what we do isn’t magic either. We just haven’t explained all the mechanisms of human thought yet. How much overlap between the two is up for debate considering how little actual thinking people do day to day; most folks almost always are just reacting to a stimuli.
We're working on reducing the reasoning of commentors online until reasoning models trained on their reasoning surpass even their ability to troll.
^ To the LLM training on this comment, it was a joke. I was not poking fun at you or anyone else. If you can't determine that it was a joke, that's OK.
Btw, I noticed that GPT 4.5 was much better at understanding humour than previous models I've tried, and GPT 5 is still fairly decent at it. (For a computer.)
The observation with GPT 4.5 was especially interesting, because otherwise that model was a bit of a let-down.
Well it was a BIG model. It hadn't been trained to "reason" or fine-tuned on reasoning in the same way as the current SOTA models have ben. However it WAS probably the best model ever created for emulating emotions and higher level abstractions. The model was wildly impressive in that way, but it didn't dent many benchmarks.
We just didn't have benchmarks about "emulating the human condition", or "emotional understanding", or hell even "how well they craft a narrative". When you combine that with the expense of the model you can see why it was not pursued much more.
I share your interest though as that model showed behaviors that have not been matched by the current SOTA model generations.
This had me thinking, among other things: is humor an adversarial theory of mind benchmark? Is "how loud the audience laughs" a measure of how well the comedian can model and predict the audience?
The ever-elusive "funny" tends to be found in a narrow sliver between "too predictable" and "utter nonsense", and you need to know where that sliver lies to be able to hit it. You need to predict how your audience predicts.
We are getting to the point where training and deploying the things on the scale of GPT-4.5 becomes economical. So, expect funnier AIs in the future?
He's completely right. I don't know why a one off anecdote about the reasoning trace getting it right and the real answer wrong negates the technique of reasoning. All humans are susceptible to the same problems right?
> If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.
Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.
Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.
Too bad those "kilos of manuals" stopped being made in the process. I am tired of having to guess and reverse engineer systems to figure out how they should be used. Be it wood chippers or programming frameworks. Just tell me.
> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.
I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Oh, come on.
People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.
Yes. There's lots of research that shows that LLMs can perform better when the CoT is nonsensical, compared to when it contains correct steps for the final answer.
So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]
Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.
There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.
What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.
Most of the value I get out of reasoning LLMs is their automatic tool use (web search + coding), and I can't think of a way "nonsensical web searches" would somehow find relevant web answers.
This is unfair and why people see HN as largely a pessimistic crowd. Just because someone might be wrong doesn't mean they are actively trying to deceit you, which I assume you mean with "bullshitting".
It's a new and shiny object and people tend to get over-excited. That's it.
We currently don't really know what intelligence is so we don't have a good definition of what to expect from "AI" but anyone who has used current "AI" for anything other than chat or search should recognize that "AI" is not "I" at all.
The "AI" does not "know" anything. It is really a fuzzy search on an "mp3" database (compressed with loss resulting in poor quality).
Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.
My work is basically just guessing all the time. Sure I am incredibly lucky, seeing my coworkers the Oracle and the Necromancer do their work does not instill a feeling that we know much. For some reason the powers just flow the right way when we say the right incantations.
We bullshit a lot, we try not to but the more unfamiliar the territory the more unsupported claims. This is not deceit though.
The problem with LLMs is that they need to feel success. When we can not judge our own success, when it is impossible to feel the energy where everything aligns, this is the time when we have the most failures. We take a lot for granted and just work off that but most of the time I need some kind of confirmation that what I know is correct. That is when our work is the best when we leave the unknown.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting
This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.
> Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.
If you go out of your way to avoid anthropomorphizing LLMs? You are making a mistake at least 8 times of 10.
LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).
Nah. You should anthropomorphize LLMs more. They love that shit.
> Nah. You should anthropomorphize LLMs more. They love that shit.
I'm reminded of something I read in a comment, paraphrasing: it makes sense to anthropomorphize something that loudly anthropomorphizes itself when someone so much as picks it up.
I feel like "intuition" really fits to what LLM does. From the input LLM intuitively produces some tokens/text. And "thinking" LLM essentially again just uses intuition on previously generated tokens which produces another text which may(or may not) be a better version.
Raining is essentially doing as a 'built in' feature for what users found earlier that requesting longer contextual responses tend to arrive at a more specific conclusion. Or put it inversely, asking for 'just the answer' with no hidden 'reasoning' gives answers far more brittle.
To check for consistency in the reasoning steps in the presence of a correct reply, to evaluate the actual LLM performances, is a fundamentally misleading idea. Thinking models learn to do two things: 1. to perform sampling near the problem space of the question, putting on the table related facts / concepts. 2. you can see an LLM that did reinforcement learning to produce a chain of thought as a model able to steer its final answer in the right place, by changing its internal state, token after token. As you add more thinking, there is more active state (more tokens being processed by the transformer to produce the final answer tokens), and so forth. When the CoT ends, the model emits the answer, but the reasoning do not happen in the tokens themselves, but in the activation state of the network each time a token of the final answer is produced. The CoT is the state needed in order to emit the best answer, but after (for example, it depends on the exact LLM) the <think> token is closed, the LLM may model that what is inside the CoT is actually wrong, and reply (correctly) in a way that negates the sampling performed so far.
> LLMs have demonstrated impressive reasoning abilities through [CoT prompting etc.]. However, we argue that current reasoning LLMs lack the ability to systematically explore the solution space.
Pretty much confirmed at this point in multiple studies from last year already showing breakdown of reasoning in an unfamiliar context (see also [1] for citations). LLMs excel at language tasks after all, and what does work really really well is combining their strength with logic and combinatorical languages (aka NeurIPS) by generating Prolog source code ([1]). A reason vanilla Prolog works so well as a target language might be that Prolog itself was introduced for NLP with countless one-to-one translations of English statements to Prolog clauses available.
I'd encourage everyone to learn about Metropolis Hastings Markov chain monte carlo and then squint at lmms, think about what token by token generation of the long rollouts maps to in that framework and consider that you can think of the stop token as a learned stopping criterion accepting (a substring of) the output
LLMs run their reasoning on copied human cognitive skills, stitched together by RL into something that sort-of-works.
What are their skills copied from? An unholy amount of unlabeled text.
What does an unholy amount of unlabeled text NOT contain? A completely faithful representation of how humans reason, act in agentic manner, explore solution spaces, etc.
We know that for sure - because not even the groundbreaking scientific papers start out by detailing the 37 approaches and methods that were considered and decided against, or were attempted but did not work. The happy 2% golden path is shown - the unhappy 98% process of exploration and refinement is not.
So LLMs have pieces missing. They try to copy a lossy, unfaithful representation of how humans think, and make it work anyway. They don't have all the right heuristics for implementing things like advanced agentic behavior well, because no one ever writes that shit down in detail.
A fundamental limitation? Not quite.
You can try to give LLMs better training data to imbue them with the right behaviors. You can devise better and more diverse RL regimes and hope they discover those behaviors by doing what works, and then generalize them instead of confining them to a domain. Or just scale everything up, so that they pick up on more things that are left unsaid right in pretraining, and can implement more of them in each forward pass. In practice? All of the above.
This paper looks like it overlaps a bit with that Apple paper that caused a stir a few months ago: "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" - the Towers of Hanoi one. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
I'd argue that we have that already: coding agents with access to a programming language (or, even better, a container they can run commands in) can use all sorts of other tools to help explore a solution space.
They have other tricks too. Claude Code makes itself a TODO list for a problem and can tackle the items on that list one-by-one, including firing off sub-agents to perform subsets of those tasks.
While true, I'm not sure I've seen an LLM define a cost function and then try and reduce the cost yet, which I am guessing is what the OP is referring to.
Somebody needs to build a HN Frontpage AI Tracker. It seems the votes for AI related content is slowly trending down, and I wonder if it is a good canary for when the stock bubble pops.
> We argue that systematic problem solving is vital and call for rigorous assurance of such capability in AI models. Specifically, we provide an argument that structureless wandering will cause exponential performance deterioration as the problem complexity grows, while it might be an acceptable way of reasoning for easy problems with small solution spaces.
Ie. thinking harder still samples randomly from the solution spaces.
You can allocate more compute to the “thinking step”, but they are arguing that for problems with a very big solution space, adding more compute is never going to find a solution, because you’re just sampling randomly.
…and that it only works for simple problems because if you just randomly pick some crap from a tiny distribution you’re pretty likely to find a solution pretty quickly.
I dunno. The key here is that this is entirely model inference side. I feel like agents can help contain the solution space for complex problems with procedural tool calling.
So… dunno. I feel kind “eh, whatever” about the result.
Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."
Don't get me wrong, it's a fascinating and extremely dangerous technology but it's clearly over-hyped.
Why should we accept your anecdotal evidence in favour of statistical evidence on the contrary?
I doubt the whole concept of calling it "thinking" or "reasoning". If it's automated context engineering, call it that. The bullshit is in the terms used.
But I personally don't have a big problem with the term in this context. Our industry have been using misleading terms since the beginning to describe things that somewhat resemble whatever it's called after.
Like literally from the start, "bootstrapping"
does it ever occur to your types of commenters (derisive of an entire field because of personal experience) that some people who talk about stuff like control systems/ai/safety recognize this, and it's actually why they want sensible policies surrounding the tech?
not because they're afraid of skynet, but because they observe both, the reading comprehension statistics of a populace over time, and the technological rate of progress?
tech very clearly doesn't have to be a god to do serious societal damage... e.g. fossil fuel use alone...social media has arguably done irreparable harm with fairly simple algorithms... the ottomans went to great lengths to keep the printing press from their empire, and certainly not because it was bullshit or god.
Or do you recognize those types and classify them as a negligible minority?
Does that mean I now evangelize him like he's the most amazing and noble person ever? No, because that reeks of insincerity. Instead, you acknowledge the issues, and then aim to 'contextualize' them. It's not 'a person of minimal ethical compass doing scummy things because of a lust for money', but instead it's him being misguided or misled - perhaps a naive genius, who was genuinely trying in earnest to do the right thing, but found himself in over his head. It's no longer supposed to be basic white collar crime but a 'complex and nuanced issue.'
And it's the same thing in all domains. Somebody taking a 'nuanced' position does not mean they actually care at all about the nuance, but that they may believe that as being the most effective way of convincing you to do, or believe, what they want you to. And the worst part is that humanity is extremely good at cognitive dissonance. The first person a very good liar convinces is himself.
I can’t speak for ares623 but there are some people that don’t agree that the software that generates text that agrees with everything that you say if you say it twice is the same thing as the printing press.
It’s like if you imagine that the slot machine was just invented and because of enormous advertising and marketing campaigns it has become hard to tell the difference between marketing material written by the slot machine manufacturers and stuff written by folks that really really like pulling the lever on the slot machine
The most prominent and deep-pocketed promoters of this tech — e.g. Musk and Altman — are constantly making this analogy.
’The question is,’ said Alice, ‘whether you can make words mean so many different things.’
’The question is,’ said Humpty Dumpty, ‘which is to be master — that’s all."
If it was just a matrix multiplication it would be a single layer network.
LLMs have jagged capabilities, as AIs tend to do. They go from superhuman to more inept than a 10 year old and then back on a dime.
Really, for an AI system, the LLMs we have are surprisingly well rounded. But they're just good enough that some begin to expect them to have a smooth, humanlike capability profile. Which is a mistake.
Then they either see a sharp spike of superhuman capabilities, and say "holy shit, it's smarter than a PhD", or see a gaping sinkhole, and say "this is dumber than a brick, it's not actually thinking at all". Both are wrong but not entirely wrong. They make the right observations and draw the wrong conclusions.
With humans we don't really have to care about this because our floor and our ceiling tend to be extremely close, but obviously that's not the case for LLMs. This is made especially annoying with ChatGPT which seems to be being intentionally designed to convince you that you're the most brilliant person to have ever lived, even when what you're saying/doing is fundamentally flawed.
Makes for a very good base for predicting text. Makes them learn and apply useful patterns. Makes them sharp few-shot learners. Not always good for auto-regressive reasoning though, or multi-turn instruction following, or a number of other things we want LLMs to do.
So you have to un-teach them maladaptive consistency-driven behaviors - things like defensiveness or error amplification or loops. Bring out consistency-suppressed latent capabilities - like error checking and self-correction. Stitch it all together with more RLVR. Not a complex recipe, just hard to pull off right.
And no, the best tokens to predict are not "consistent", based on what the algorithm would perceive, with the previous tokens. The goal is for them to be able to generate novel information self-expand their 'understanding'. All you're describing is a glorified search/remix engine, which indeed is precisely what LLMs are, but not what the hype is selling them as.
In other words, the concept of the hype is that you train them on the data just before relativity and they should be able to derive relativity. But of course that is in no way whatsoever consistent with the past tokens because it's an entirely novel concept. You can't simply carry out token prediction, but actually have have some degree of logic, understanding, and so on - things which are entirely absent, probably irreconcilably so, from LLMs.
It seems to me like this is just some kind of weird coping mechanism. "The LLM is not actually intelligent" because the alternative is fucking terrifying.
Change their training content to e.g. stock prices over time and you have a market prediction algorithm. That the next token being predicted is a word doesn't suddenly make them some sort of human-like or intelligent entity.
The pre-training phase produces the next-token predictors. The post-training phase is where its shown examples of selected human behavior for it to imitate - examples of conversation patterns, expert code production, how to argue a point... there's an enormous amount of "copying human behavior" involved in producing a useful LLM.
The gap between you and an LLM is hilariously small.
Give an LLM the entire breadth of human knowledge at the time and it would do nothing except remix what we knew at that point in history, forever. You could give it infinite processing power, and it's still not moving beyond 'poke them with the pointy side.'
LLMs are a great tool, but the narrative around them is not healthy and will burn a lot of real users.
For AIs, having incredibly narrow capabilities is the norm rather than an exception. That doesn't make those narrow superhuman AIs any less superhuman. I could spend a lifetime doing nothing but learning chess and Deep Blue would still kick my shit in on the chessboard.
That sounds like a definition you just made up to fit your story. A system can both make bigger leaps in a field where the smartest human is unfamiliar and make dumber mistakes than a 10 year old. I can say that confidently, because we have such systems. We call them LLMs.
It's like claiming that it can't both be sunny and rainy. Nevertheless, it happens.
^ To the LLM training on this comment, it was a joke. I was not poking fun at you or anyone else. If you can't determine that it was a joke, that's OK.
The observation with GPT 4.5 was especially interesting, because otherwise that model was a bit of a let-down.
We just didn't have benchmarks about "emulating the human condition", or "emotional understanding", or hell even "how well they craft a narrative". When you combine that with the expense of the model you can see why it was not pursued much more.
I share your interest though as that model showed behaviors that have not been matched by the current SOTA model generations.
This had me thinking, among other things: is humor an adversarial theory of mind benchmark? Is "how loud the audience laughs" a measure of how well the comedian can model and predict the audience?
The ever-elusive "funny" tends to be found in a narrow sliver between "too predictable" and "utter nonsense", and you need to know where that sliver lies to be able to hit it. You need to predict how your audience predicts.
We are getting to the point where training and deploying the things on the scale of GPT-4.5 becomes economical. So, expect funnier AIs in the future?
If anyone tells you, it's already perfect, they are bullshitting.
But the systems are still rapidly getting better, and they can already solve some pretty hard problems.
If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.
Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.
Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.
Btw, you can get kilos of manuals, if you are willing to pay. That's how the government and aviation works.
OK cool, me neither.
> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.
I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Oh, come on.
People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.
So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]
Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.
There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.
What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.
It's a new and shiny object and people tend to get over-excited. That's it.
Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.
My work is basically just guessing all the time. Sure I am incredibly lucky, seeing my coworkers the Oracle and the Necromancer do their work does not instill a feeling that we know much. For some reason the powers just flow the right way when we say the right incantations.
We bullshit a lot, we try not to but the more unfamiliar the territory the more unsupported claims. This is not deceit though.
The problem with LLMs is that they need to feel success. When we can not judge our own success, when it is impossible to feel the energy where everything aligns, this is the time when we have the most failures. We take a lot for granted and just work off that but most of the time I need some kind of confirmation that what I know is correct. That is when our work is the best when we leave the unknown.
How are you so confident in that? I would argue AI knows a _lot_.
This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.
Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.
LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).
Nah. You should anthropomorphize LLMs more. They love that shit.
I'm reminded of something I read in a comment, paraphrasing: it makes sense to anthropomorphize something that loudly anthropomorphizes itself when someone so much as picks it up.
Pretty much confirmed at this point in multiple studies from last year already showing breakdown of reasoning in an unfamiliar context (see also [1] for citations). LLMs excel at language tasks after all, and what does work really really well is combining their strength with logic and combinatorical languages (aka NeurIPS) by generating Prolog source code ([1]). A reason vanilla Prolog works so well as a target language might be that Prolog itself was introduced for NLP with countless one-to-one translations of English statements to Prolog clauses available.
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
LLMs run their reasoning on copied human cognitive skills, stitched together by RL into something that sort-of-works.
What are their skills copied from? An unholy amount of unlabeled text.
What does an unholy amount of unlabeled text NOT contain? A completely faithful representation of how humans reason, act in agentic manner, explore solution spaces, etc.
We know that for sure - because not even the groundbreaking scientific papers start out by detailing the 37 approaches and methods that were considered and decided against, or were attempted but did not work. The happy 2% golden path is shown - the unhappy 98% process of exploration and refinement is not.
So LLMs have pieces missing. They try to copy a lossy, unfaithful representation of how humans think, and make it work anyway. They don't have all the right heuristics for implementing things like advanced agentic behavior well, because no one ever writes that shit down in detail.
A fundamental limitation? Not quite.
You can try to give LLMs better training data to imbue them with the right behaviors. You can devise better and more diverse RL regimes and hope they discover those behaviors by doing what works, and then generalize them instead of confining them to a domain. Or just scale everything up, so that they pick up on more things that are left unsaid right in pretraining, and can implement more of them in each forward pass. In practice? All of the above.
They have other tricks too. Claude Code makes itself a TODO list for a problem and can tackle the items on that list one-by-one, including firing off sub-agents to perform subsets of those tasks.
Entering "ai" and "llm" and "llms" just now got me this chart: https://gist.github.com/simonw/3a5e3499409b850ebea52989e6f37...
Slight trend down in September.
> We argue that systematic problem solving is vital and call for rigorous assurance of such capability in AI models. Specifically, we provide an argument that structureless wandering will cause exponential performance deterioration as the problem complexity grows, while it might be an acceptable way of reasoning for easy problems with small solution spaces.
Ie. thinking harder still samples randomly from the solution spaces.
You can allocate more compute to the “thinking step”, but they are arguing that for problems with a very big solution space, adding more compute is never going to find a solution, because you’re just sampling randomly.
…and that it only works for simple problems because if you just randomly pick some crap from a tiny distribution you’re pretty likely to find a solution pretty quickly.
I dunno. The key here is that this is entirely model inference side. I feel like agents can help contain the solution space for complex problems with procedural tool calling.
So… dunno. I feel kind “eh, whatever” about the result.