Why aren't LLM's trained on their own Chain Of Thought?

I realise that without allowing reasoning tokens, a model performs very poorly. It can't perform simple arithmetic or simple logic and hallucinates a bit.

But by allowing it to think a bit and then answer, the result is much better and way more trustable.

This shows a clean RL environment.. or just a nice data-set. Where you prompt the model two times - one without allowing thinking and one with thinking. Penalise the result from non thinking if the result contradicts the answer obtained from thinking.

2 points | by simianwords 1 hour ago

1 comments

  • i7l 1 hour ago
    For the same reason that anyone's reasoning process and answers to random exam questions are never used as textbooks: if the reasoning is not guaranteed to be right, why would you want to make that training material?
    • simianwords 1 hour ago
      We can empirically figure out how often the reasoning model is correct. With a 95% empirical accuracy, it should still help the model directionally. No training data set needs to be 100% accurate. No?