Here's a high level overview for a programmer audience (I'm listed as an author but my contributions were fairly minor):
[See specifics of the pipeline in Table 3 of the linked paper]
* There are 181 million ish essentially different Turing Machines with 5 states, first these were enumerated exhaustively
* Then, each machine was run for about 100 million steps. Of the 181 million, about 25% of them halt within this memany step, including the Champion, which ran for 47,176,870 steps before halting.
* This leaves 140 million machines which run for a long time.
So the question is: do those TMs run FOREVER, or have we just not run them long enough?
The goal of the BB challenge project was to answer this question. There is no universal algorithm that works on all TMs, so instead a series of (semi-)deciders were built. Each decider takes a TM, and (based on some proven heuristic) classifies it as either "definitely runs forever" or "maybe halts".
Four deciders ended up being used:
* Loops: run the TM for a while, and if it re-enters a previously-seen configuration, it definitely has to loop forever. Around 90% of machines do this or halt, so covers most.
6.01 million TMs remain.
* NGram CPS: abstractly simulates each TM, tracking a set of binary "n-grams" that are allowed to appear on each side. Computes an over-approximation of reachable states. If none of those abstract states enter the halting transition, then the original machine cannot halt either.
Covers 6.005 million TMs. Around 7000 TMs remain.
* RepWL: Attempts to derive counting rules that describe TM configurations. The NGram model can't "count", so this catches many machines whose patterns depend on parity. Covers 6557 TMs.
There are only a few hundred TMs left.
* FAR: Attempt to describe each TM's state as a regex/FSM.
* WFAR: like FAR but adds weighted edges, which allows some non-regular languages (like matching parentheses) to be described
* Sporadics: around 13 machines had complicated behavior that none of the previous deciders worked for. So hand-written proofs (later translated into Rocq) were written for these machines.
All of the deciders were eventually written in Rocq, which means that they are coupled with a formally-verified proof that they actually work as intended ("if this function returns True, then the corresponding mathematical TM must actually halt").
Hence, all 5-states TMs have been formally classified as halting or non-halting. The longest running halter is therefore the champion- it was already suspected to be the champion, but this proves that there wasn't any longer-running 5-state TM.
In Coq-BB5 machines are not run for 100M steps but directly thrown in the pipeline of deciders. Most halting machines are detected by Loops using low parameters (max 4,100 steps) and only 183 machines are simulated up to 47M steps to deduce halting.
In legacy bbchallenge's seed database, machines were run for exactly 47,176,870 steps, and we were left with about 80M nonhalting candidates. Discrepancy comes from Coq-BB5 not excluding 0RB and other minor factors.
Also, "There are 181 million ish essentially different Turing Machines with 5 states", it is important to note that we end up with this number only after the proof is completed, knowing this number is as hard as solving BB(5).
Why is knowing that there are 181 M 5-state machines difficult? Is it not just (read*write*move*(state+halt))^state? About 48^5, modulo "essentially different".
Number of 5-state TMs is 21^10 = 16,679,880,978,201; coming from (1 + writemovestate)^2*state; difference with your formula is that in our model, halting is encoded using undefined transition.
"essentially different" is not a statically-checked property. It is discovered by the enumeration algorithm (Tree Normal Form, see Section 3 of https://arxiv.org/pdf/2509.12337); in particular, for each machine, the algorithm needs to know it if will ever reach an undefined transition because if it does it will visit the children of that machine (i.e. all ways to set that reached undefined transition).
Knowing if an undefined transition will be ever reached is not computable, hence knowing the number of enumerated machines is not computable in general and is as hard as solving BB(5).
I guess the meaning of "essentially different" essentially depends on the strength of the mathematical theory that you used to classify them!
When I first heard it I thought about using some kind of similar symmetry arguments (e.g. swapping left-move and right-move). Maybe there are also more elaborate symmetry arguments of some kind.
Isn't it fair to say that there is no single objective definition of what differences between machines are "essential" here? If you have a stronger theory and stronger tools, you can draw more distinctions between TMs; with a weaker theory and weaker tools, you can draw fewer distinctions.
By analogy, suppose you were looking at groups. As you get a more sophisticated group theory you can understand more reasons or ways that two groups are "the same". I guess there is a natural limit of group isomorphism, but perhaps there are still other things where group structure or behavior is "the same" in some interesting or important ways even between non-isomorphic groups?
My intuition is that this would include every 4 state (and fewer) machine that has a fifth state that is never used. For every different configuration of that fifth state I would consider the machine to be essentially the same.
what's most interesting to me about this research is that it is an online collaborative one. I wonder how many more project such as this there are, and if it could be more widespread, maybe as a platform.
This comment reminded me to check whether https://www.distributed.net/ was still in existence. I hadn't thought about the site for probably two decades, I ran the client for this back in the late 1990s back when they were cracking RC5-64, but they still appear to be going as a platform that could be used for this kind of thing.
I was also excited about those projects and ran DESchall as well as distributed.net clients. Later on I was running the EFF Cooperative Computing Award (https://www.eff.org/awards/coop), as in administering the contest, not as in running software to search for solutions!
The original cryptographic challenges like the DES challenge and the RSA challenges had a goal to demonstrate something about the strength of cryptosystems (roughly, that DES and, a fortiori, 40-bit "export" ciphers were pretty bad, and that RSA-1024 or RSA-2048 were pretty good). The EFF Cooperative Computing Award had a further goal -- from the 1990s -- to show that Internet collaboration is powerful and useful.
Today I would say that all of these things have outlived their original goals, because the strength of DES, 40-bit ciphers, or RSA moduli are now relatively apparent; we can get better data about the cost of brute-force cryptanalytic attacks from the Bitcoin network hashrate (which obviously didn't exist at all in the 1990s), and the power and effectiveness of Internet collaboration, including among people who don't know each other offline and don't have any prior affiliation, has, um, been demonstrated very strongly over and over and over again. (It might be hard to appreciate nowadays how at one time some people dismissed the Internet as potentially not that important.)
This Busy Beaver collaboration and Terence Tao's equational theories project (also cited in this paper) show that Internet collaboration among far-flung strangers for substantive mathematics research, not just brute force computation, is also a reality (specifically now including formalized, machine-checked proofs).
There's still a phenomenon of "grid computing" (often with volunteer resources), working on a whole bunch of computational tasks:
It's really just the specific "establish the empirical strength of cryptosystems" and "show that the Internet is useful and important" 1990s goals that are kind of done by this point. :-)
The Busy Beaver problem is one of the most theoretical, one of the most purely mathematical, of all theoretical computer science questions. It is really about what is possible in a very abstract sense.
Working on it did make people cleverer at analyzing the behavior of software, but it's not obvious that those skills or associated techniques will directly transfer to analyzing the behavior of much more complex software that does practical stuff. The programs that were being analyzed here are extraordinarily small by typical software developer standards.
To be more specific, it seems conceivable to me that some of these methods could inspire deductive techniques that can be used in some way in proof assistants or optimizing compilers, in order to help ensure correctness of more complicated programs, but again that application isn't obvious or guaranteed. The people working on this collaboration would definitely have described themselves as doing theoretical computer science rather than applied computer science.
The research into BB numbers is purely academic and unlikely to be more than that, unless some other part of mathematics turns out to be wrong. Our current understanding is that the numbers are essentially guaranteed to be useless.
This particular proof is doubly-academic in the sense that the value was already known, this is just a way to make it easier to independently verify the result.
It's a part of a broader movement to provide machine proofs for other stuff (e.g., Fermat's last theorem), which may be beneficial in some ways (e.g., identifying issues, making parts of proofs more "reusable").
I'm going to push back on “the value was already known”; my understanding is that the candidate value was known, but while there was a single machine with behaviour that was not yet characterized, it quite easily could have been the real champion.
Ah yeah, I stand corrected. I didn't realize this is a paper by the original team that found the number a while back, I thought it's an independent formalization in 2025.
G. H. Hardy's autobiography was titled "A Mathematician's Apology" because he, somewhat jokingly, wanted to apologize for a life of pure math... totally academic and completely useless.
Then a few years after he passed away, his math was key for The Manhattan Project to build the first successful nuclear bomb. His math lead to the device that changed the world and affected every aspect of human life for a century.
It's only "useless" until someone finds a "use".
P.S. The book documents his time with Ramanujan. If you liked the film "The Man Who Knew Infinity", you should read Apology
Look, if you tell me that this is just an interesting problem, I'm happy to hear that. If it helps prove some exotic theorem, I'm happy to hear that. I'm also happy to hear that this has a direct application somewhere.
I am also unfamiliar, but a naive guess I can make is helping solve other math proofs, and maybe better encryption for the public. Basically everytime I see some progress in math, it usually means encryption in some way is impacted.
Define tangible benefits and give example of any pure mathematics in last 200 years having tangible benefits.
One benefit is that it is related to foundations of mathematics. If you can find busy beaver number 745, you can prove or disprove Maths. If you can find busy beaver 6, you can prove few collatz like conjecture which is an open problem.
Most core research doesn't help directly, but through indirect means. e.g. This would have led to finding the areas of improvements in Lean due to its complexity. I know for fact that Lean community learns a lot from projects like Mathlib and this. And Lean is used in all sort of scenarios like making AWS better to improving safety of flights.
Not quite, I think this is the relevant part of the paper:
> Structure of the proof. The proof of our main result, Theorem 1.1, is given in Section 6. The structure of the proof is as follows: machines are enumerated arborescently in Tree Normal Form (TNF) [9] – which drastically reduces the search space’s size: from 16,679,880,978,201 5-state machines to “only” 181,385,789; see Section 3. Each enumerated machine is fed through a pipeline of proof techniques, mostly consisting of deciders, which are algorithms trying to decide whether the machine halts or not. Because of the uncomputability of the halting problem, there is no universal decider and all the craft resides in creating deciders able to decide large families of machines in reasonable time. Almost all of our deciders are instances of an abstract interpretation framework that we call Closed Tape Language (CTL), which consists in approximating the set of configurations visited by a Turing machine with a more convenient superset, one that contains no halting configurations and is closed under Turing machine transitions (see Section 4.2). The S(5) pipeline is given in Table 3 – see Table 4 for S(2,4). All the deciders in this work were crafted by The bbchallenge Collaboration; see Section 4. In the case of 5-state machines, 13 Sporadic Machines were not solved by deciders and required individual proofs of nonhalting, see Section 5.
So, they figured out how to massively reduce the search space, wrote some generic deciders that were able to prove whether large amounts of the remaining search spaces would halt or not, and then had to manually solve the remaining 13 machines that the generic deciders couldn't reason about.
There are a few concepts at play here. First you have to consider what can be proven given a particular theory of mathematics (presumably a consistent, recursively axiomatizable one). For any such theory, there is some finite N for which that theory cannot prove the exact value of BB(N). So with "infinite time", one could (in principle) enumerate all proofs and confirm successive Busy Beaver values only up to the point where the theory runs out of power. This is the Busy Beaver version of Gödel/Chaitin incompleteness. For BB(5), Peano Arithmetic suffices and RCA₀ likely does as well. Where do more powerful theories come from? That's a bit of a mystery, although there's certainly plenty of research on that (see Feferman's and Friedman's work).
Second, you have to consider what's feasible in finite time. You can enumerate machines and also enumerate proofs, but any concrete strategy has limits. In the case of BB(5), the authors did not use naive brute force. They exhaustively enumerated the 5-state machines (after symmetry reductions), applied a collection of certified deciders to prove halting/non-halting behavior for almost all of them, and then provided manual proofs (also formalized) for some holdout machines.
I think you have to exhaustively check each 5-state TM, but then for each one brute force will only help a bit - brute force can't tell you that a TM will run forever without stopping?
Isn't it rather that the Busy Beaver function is uncomputable, particular values can be calculated - although anything great than BB(5) is quite a challenge!
Only finitely many values of BB can be mathematically determined. Once your Turing Machines become expressive enough to encode your (presumably consistent) proof system, they can begin encoding nonsense of the form "I will halt only after I manage to derive a proof that I won't ever halt", which means that their halting status (and the corresponding Busy Beaver value) fundamentally cannot be proven.
If such proofs exist that are checkable by a proof assistant, you can brute force them by enumerating programs in the proof assistant (with a comically large runtime). Therefore there is some small N where any proof of BB(N) is not checkable with existing proof assistants. Essentially, this paper proves that BB(5) was brute forcible!
The most naive algorithm is to use the assistant to check if each length 1 coq program can prove halting with computation limited to 1 second, then check each length 2 coq program running for 2 seconds, etc till the proofs in the arxiv paper are run for more than their runtime.
In this perspective, couldn't you equally say that all formalized mathematics has been brute forced, because you found working programs that prove your desired results and that are short enough that a human being could actually discover and use them?
... even though your actual method of discovering the programs in question was usually not purely exhaustive search (though it may have included some significant computer search components).
More precisely, we could say that if mathematicians are working in a formal system, they can't find any results that a computer with "sufficiently large" memory and runtime couldn't also find. Yet currently, human mathematicians are often more computationally efficient in practice than computer mathematicians, and the human mathematicians often find results that bounded computer mathematicians can't. This could very well change in the future!
Like it was somewhat clear in principle that a naive tree search algorithm in chess should be able to beat any human player, given "sufficiently large" memory and runtime (e.g. to exhaustively check 30 or 40 moves ahead or something). However, real humans were at least occasionally able to beat top computer programs at chess until about 2005. (This analogy isn't perfect because proof correctness or incorrectness within a formal system is clear, while relative strength in chess is hard to be absolutely sure of.)
You can try them with simple short loops detectors, or perhaps with the "turtle and hare" method. If I do that and a friend ask me how I solved it, I'd call that "bruteforce".
They solved a lot of the machines with something like that, and some with more advanced methods, and "13 Sporadic Machines" that don't halt were solved with a hand coded proof.
The busy beaver numbers form an uncomputable sequence.
For BB(5) the proof of its value is an indirect computation. The verification process involved both computation (running many machines) and proofs (showing others run forever or halt earlier). The exhaustiveness of crowdsourced proofs was a tour de force.
[See specifics of the pipeline in Table 3 of the linked paper]
* There are 181 million ish essentially different Turing Machines with 5 states, first these were enumerated exhaustively
* Then, each machine was run for about 100 million steps. Of the 181 million, about 25% of them halt within this memany step, including the Champion, which ran for 47,176,870 steps before halting.
* This leaves 140 million machines which run for a long time.
So the question is: do those TMs run FOREVER, or have we just not run them long enough?
The goal of the BB challenge project was to answer this question. There is no universal algorithm that works on all TMs, so instead a series of (semi-)deciders were built. Each decider takes a TM, and (based on some proven heuristic) classifies it as either "definitely runs forever" or "maybe halts".
Four deciders ended up being used:
* Loops: run the TM for a while, and if it re-enters a previously-seen configuration, it definitely has to loop forever. Around 90% of machines do this or halt, so covers most.
6.01 million TMs remain.
* NGram CPS: abstractly simulates each TM, tracking a set of binary "n-grams" that are allowed to appear on each side. Computes an over-approximation of reachable states. If none of those abstract states enter the halting transition, then the original machine cannot halt either.
Covers 6.005 million TMs. Around 7000 TMs remain.
* RepWL: Attempts to derive counting rules that describe TM configurations. The NGram model can't "count", so this catches many machines whose patterns depend on parity. Covers 6557 TMs.
There are only a few hundred TMs left.
* FAR: Attempt to describe each TM's state as a regex/FSM.
* WFAR: like FAR but adds weighted edges, which allows some non-regular languages (like matching parentheses) to be described
* Sporadics: around 13 machines had complicated behavior that none of the previous deciders worked for. So hand-written proofs (later translated into Rocq) were written for these machines.
All of the deciders were eventually written in Rocq, which means that they are coupled with a formally-verified proof that they actually work as intended ("if this function returns True, then the corresponding mathematical TM must actually halt").
Hence, all 5-states TMs have been formally classified as halting or non-halting. The longest running halter is therefore the champion- it was already suspected to be the champion, but this proves that there wasn't any longer-running 5-state TM.
In legacy bbchallenge's seed database, machines were run for exactly 47,176,870 steps, and we were left with about 80M nonhalting candidates. Discrepancy comes from Coq-BB5 not excluding 0RB and other minor factors.
Also, "There are 181 million ish essentially different Turing Machines with 5 states", it is important to note that we end up with this number only after the proof is completed, knowing this number is as hard as solving BB(5).
"essentially different" is not a statically-checked property. It is discovered by the enumeration algorithm (Tree Normal Form, see Section 3 of https://arxiv.org/pdf/2509.12337); in particular, for each machine, the algorithm needs to know it if will ever reach an undefined transition because if it does it will visit the children of that machine (i.e. all ways to set that reached undefined transition).
Knowing if an undefined transition will be ever reached is not computable, hence knowing the number of enumerated machines is not computable in general and is as hard as solving BB(5).
When I first heard it I thought about using some kind of similar symmetry arguments (e.g. swapping left-move and right-move). Maybe there are also more elaborate symmetry arguments of some kind.
Isn't it fair to say that there is no single objective definition of what differences between machines are "essential" here? If you have a stronger theory and stronger tools, you can draw more distinctions between TMs; with a weaker theory and weaker tools, you can draw fewer distinctions.
By analogy, suppose you were looking at groups. As you get a more sophisticated group theory you can understand more reasons or ways that two groups are "the same". I guess there is a natural limit of group isomorphism, but perhaps there are still other things where group structure or behavior is "the same" in some interesting or important ways even between non-isomorphic groups?
Is there some reason why Rocq is the proof assistant of choice and not one of the others?
also.... https://www.youtube.com/watch?v=c3sOuEv0E2I
For example:
https://teorth.github.io/equational_theories/
https://teorth.github.io/pfr/
- https://conwaylife.com/, on Conway's GoL and other cellular automata
- Googology, https://googology.fandom.com/wiki/Googology_Wiki and https://googology.miraheze.org/wiki/Main_Page, on big numbers
The original cryptographic challenges like the DES challenge and the RSA challenges had a goal to demonstrate something about the strength of cryptosystems (roughly, that DES and, a fortiori, 40-bit "export" ciphers were pretty bad, and that RSA-1024 or RSA-2048 were pretty good). The EFF Cooperative Computing Award had a further goal -- from the 1990s -- to show that Internet collaboration is powerful and useful.
Today I would say that all of these things have outlived their original goals, because the strength of DES, 40-bit ciphers, or RSA moduli are now relatively apparent; we can get better data about the cost of brute-force cryptanalytic attacks from the Bitcoin network hashrate (which obviously didn't exist at all in the 1990s), and the power and effectiveness of Internet collaboration, including among people who don't know each other offline and don't have any prior affiliation, has, um, been demonstrated very strongly over and over and over again. (It might be hard to appreciate nowadays how at one time some people dismissed the Internet as potentially not that important.)
This Busy Beaver collaboration and Terence Tao's equational theories project (also cited in this paper) show that Internet collaboration among far-flung strangers for substantive mathematics research, not just brute force computation, is also a reality (specifically now including formalized, machine-checked proofs).
There's still a phenomenon of "grid computing" (often with volunteer resources), working on a whole bunch of computational tasks:
https://en.wikipedia.org/wiki/List_of_grid_computing_project...
It's really just the specific "establish the empirical strength of cryptosystems" and "show that the Internet is useful and important" 1990s goals that are kind of done by this point. :-)
https://bbchallenge.org/13650583
Working on it did make people cleverer at analyzing the behavior of software, but it's not obvious that those skills or associated techniques will directly transfer to analyzing the behavior of much more complex software that does practical stuff. The programs that were being analyzed here are extraordinarily small by typical software developer standards.
To be more specific, it seems conceivable to me that some of these methods could inspire deductive techniques that can be used in some way in proof assistants or optimizing compilers, in order to help ensure correctness of more complicated programs, but again that application isn't obvious or guaranteed. The people working on this collaboration would definitely have described themselves as doing theoretical computer science rather than applied computer science.
This particular proof is doubly-academic in the sense that the value was already known, this is just a way to make it easier to independently verify the result.
It's a part of a broader movement to provide machine proofs for other stuff (e.g., Fermat's last theorem), which may be beneficial in some ways (e.g., identifying issues, making parts of proofs more "reusable").
G. H. Hardy's autobiography was titled "A Mathematician's Apology" because he, somewhat jokingly, wanted to apologize for a life of pure math... totally academic and completely useless.
Then a few years after he passed away, his math was key for The Manhattan Project to build the first successful nuclear bomb. His math lead to the device that changed the world and affected every aspect of human life for a century.
It's only "useless" until someone finds a "use".
P.S. The book documents his time with Ramanujan. If you liked the film "The Man Who Knew Infinity", you should read Apology
I just want to know.
One benefit is that it is related to foundations of mathematics. If you can find busy beaver number 745, you can prove or disprove Maths. If you can find busy beaver 6, you can prove few collatz like conjecture which is an open problem.
Most core research doesn't help directly, but through indirect means. e.g. This would have led to finding the areas of improvements in Lean due to its complexity. I know for fact that Lean community learns a lot from projects like Mathlib and this. And Lean is used in all sort of scenarios like making AWS better to improving safety of flights.
> Structure of the proof. The proof of our main result, Theorem 1.1, is given in Section 6. The structure of the proof is as follows: machines are enumerated arborescently in Tree Normal Form (TNF) [9] – which drastically reduces the search space’s size: from 16,679,880,978,201 5-state machines to “only” 181,385,789; see Section 3. Each enumerated machine is fed through a pipeline of proof techniques, mostly consisting of deciders, which are algorithms trying to decide whether the machine halts or not. Because of the uncomputability of the halting problem, there is no universal decider and all the craft resides in creating deciders able to decide large families of machines in reasonable time. Almost all of our deciders are instances of an abstract interpretation framework that we call Closed Tape Language (CTL), which consists in approximating the set of configurations visited by a Turing machine with a more convenient superset, one that contains no halting configurations and is closed under Turing machine transitions (see Section 4.2). The S(5) pipeline is given in Table 3 – see Table 4 for S(2,4). All the deciders in this work were crafted by The bbchallenge Collaboration; see Section 4. In the case of 5-state machines, 13 Sporadic Machines were not solved by deciders and required individual proofs of nonhalting, see Section 5.
So, they figured out how to massively reduce the search space, wrote some generic deciders that were able to prove whether large amounts of the remaining search spaces would halt or not, and then had to manually solve the remaining 13 machines that the generic deciders couldn't reason about.
Second, you have to consider what's feasible in finite time. You can enumerate machines and also enumerate proofs, but any concrete strategy has limits. In the case of BB(5), the authors did not use naive brute force. They exhaustively enumerated the 5-state machines (after symmetry reductions), applied a collection of certified deciders to prove halting/non-halting behavior for almost all of them, and then provided manual proofs (also formalized) for some holdout machines.
https://scottaaronson.blog/?p=8972
You need proofs of nontermination for machines that don't halt. This isn't possible to bruteforce.
The most naive algorithm is to use the assistant to check if each length 1 coq program can prove halting with computation limited to 1 second, then check each length 2 coq program running for 2 seconds, etc till the proofs in the arxiv paper are run for more than their runtime.
... even though your actual method of discovering the programs in question was usually not purely exhaustive search (though it may have included some significant computer search components).
More precisely, we could say that if mathematicians are working in a formal system, they can't find any results that a computer with "sufficiently large" memory and runtime couldn't also find. Yet currently, human mathematicians are often more computationally efficient in practice than computer mathematicians, and the human mathematicians often find results that bounded computer mathematicians can't. This could very well change in the future!
Like it was somewhat clear in principle that a naive tree search algorithm in chess should be able to beat any human player, given "sufficiently large" memory and runtime (e.g. to exhaustively check 30 or 40 moves ahead or something). However, real humans were at least occasionally able to beat top computer programs at chess until about 2005. (This analogy isn't perfect because proof correctness or incorrectness within a formal system is clear, while relative strength in chess is hard to be absolutely sure of.)
They solved a lot of the machines with something like that, and some with more advanced methods, and "13 Sporadic Machines" that don't halt were solved with a hand coded proof.
For BB(5) the proof of its value is an indirect computation. The verification process involved both computation (running many machines) and proofs (showing others run forever or halt earlier). The exhaustiveness of crowdsourced proofs was a tour de force.