But by allowing it to think a bit and then answer, the result is much better and way more trustable.
This shows a clean RL environment.. or just a nice data-set. Where you prompt the model two times - one without allowing thinking and one with thinking. Penalise the result from non thinking if the result contradicts the answer obtained from thinking.
1 comments