Test, don't (just) verify

(alperenkeles.com)

121 points | by alpaylan 4 hours ago

24 comments

  • anon-3988 1 hour ago
    Before we start writing Lean. Perhaps we can start with something "dumber" like Rust or any typed program. If you want to write something correct, or you care about correctness, you should not be using dynamic languages. The most useful and used type of test is type checking.

    Type errors, especially once you have designed your types to be correct by construction, is extremely, extremely useful for LLMs. Once you have the foundation correct, they just have to wiggle through that narrow gap until it figures out something that fits.

    But from what I understood and read so far, I am not convinced of OP's "formal verification". A simple litmus test is to take any of your recent day job task and try to describe a formal specification of it. Is it even doable? Reasonable? Is it even there? For me the most useful kind of verification is the verification of the lower level tools i.e. data structures, language, compilers etc

    For example, the type signature of Vec::operator[usize] in Rust returns T. This cannot be true because it cannot guarantee to return a T given ANY usize. To me, panic is the most laziest and worst ways to put in a specification. It means that every single line of Rust code is now able to enter this termination state.

    • phaedrus 1 hour ago
      I once attended a talk by someone who is or was big in the node.js world. He opened with the premise, "a static type check is just a stand-in for a unit test."

      I wanted to throw a shoe at him. A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.

      Put another way, this common misconception by users of languages like Javascript and Python that unit testing is just as good as type checking (plus more flexible) is a confusion between the "exists" and "for all" logical operators.

      • kccqzy 1 hour ago
        Plus, it is simply more enjoyable to design the types in your program than to write unit tests. The fun factor comes from operating on a higher level of abstraction and engages more of your brain’s puzzle-solving mode than just writing unit tests. Making yourself think about “for all x” rather than a concrete x forces your brain to consider deeply the properties of x being used.
      • seg_lol 52 minutes ago
        Hundreds of unit tests replace a type.

        Start using properties and it is in the thousands.

        Most code should be typed. Python is great for prototypes, but once the prototype gels, you need types.

        • lelanthran 11 minutes ago
          I've always hated Python. Could never enjoy it at all. Pretty much the same poor DX as PHP, Javascript, Ruby, etc.

          Finally set up neovim with pyright; use types on every single fucking thing, and now I love Python[1].

          Can't wait to see TC39 become a reality (learned about it just this past week on HN, actually). Maybe I'll enjoy Javascript too.

          --------------------

          [1] Within reason; the packaging experience is still extremely poor!

        • baq 20 minutes ago
          Good that Python supports types then
          • lelanthran 10 minutes ago
            > Good that Python supports types then

            "Optional typing" is not the same as "Static typing".

            Great, my program will crash, because I forgot to opt-in to typing :-/

          • SR2Z 10 minutes ago
            Poorly, though, and with lots of edge cases and foot guns to make you miserable once your code calls out to numpy or gets some JSON.
      • lesuorac 55 minutes ago
        If you care about the type of a parameter you can just add an assertion in the method /s
    • packetlost 1 hour ago
      I think it's possible to write correct systems with dynamic languages, just not the ones we commonly use like Python and JavaScript. I find Clojure, for example to be one example of a dynamic language that is pretty easy to manage and I attribute that to the immutable nature and data-centric ethos. I'm sure there are other dynamic languages that would work as well.

      Now, I wouldn't necessarily use Clojure on a huge multi-organization codebase (maybe it's fine, this is outside of my experience with it), but it can be the right tool for some jobs.

      • brabel 1 hour ago
        Common Lisp as well. I can’t explain why, but type errors are just not something I struggle with in Common Lisp! But it is in JS and Python for sure. Maybe someone knows why it feels different?
        • packetlost 59 minutes ago
          I haven't done much with CL so I can only speculate, but I think stricter FP principles in general work to minimize the downsides of dynamic typing. CL, to my understanding, isn't the most "pure" when it comes to FP, but does a good job at giving the programmer a lot of power to constrain and explore systems.
    • gaogao 20 minutes ago
      Yeah, Rust has been pretty good for formal verification so far. Hoare spec contracts I think are the way forward, especially since they fairly naturally flow from unittests. I've been using Hax to pretty good effect so far. I'm generally suspect that advances in Lean proof solving by dedicated models are that useful for program verification, compared to generalist models, though it could help lower costs a good bit.
    • didip 17 minutes ago
      I agree. And learning a typed language is significantly easier now that AI can explain everything. The types also help AI to write a correct code. A very positive feedback loop.
    • loglog 55 minutes ago
      The author is in the comfortable position of working on a system that does have a formal specification and a formally verified reference implementation. The post is not about how they wish things would work, but how their existing system (Cedar) works.

      Regarding your point on Rust, the vast majority of software has nowhere near the amount of static guarantees provided by Rust. If you need more, use static memory allocation, that's what people do for safety critical systems. By the way, it seems that Rust aborts on OOM errors, not panics: https://github.com/rust-lang/rust/issues/43596

  • MarkMarine 3 minutes ago
    This is an aside because I agree with the author’s core point, but spelling, grammatical errors, and typos actually imply something authored by a human now. This sentence:

    “It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development.”

    made me smile. Reading something hand made that hadn’t been through the filters and presses of modern internet writing.

  • zipy124 3 hours ago
    I think this misses a lot of reasons why learning verification is important. For instance learning the concept of invariants and their types such as loop invariants. They make reasoning about code in general easier, even if you never formally do any verification, it makes it easier to write tests or asserts(). A substantial amount of bugs are due to the program having a different state to that assumed by the programmer, and there are other tools that help with this. For example a statically typed language is a type of verification since it verifies a variable has a specific type and thus operations that can be performed on it, and limits the valid input and output range of any function. Languages like Rust are also verification in terms of memory correctness, and are also extremely useful tools.
  • tgtweak 2 hours ago
    I think more salient here (at term certainly) is setting up adversarial agents for testing/verification - that has been a big win for me in multi-agent workflows - when claude first released "computer use" that was a very big step in closing this loop and avoiding the manual babysitting involved in larger projects. PSA that it's not a silver bullet as the "analyzer" can still get tripped up and falsely declare something as broken (or functional), but it greatly reduces the "Hey I've done the task" when the task is not done or the output is broken.
    • esafak 1 hour ago
      So it is not reliable enough to run automatically?
  • getregistered 3 hours ago
    > AI-assisted programming pushes the limits of programming from what you can implement to what you can specify and what you can verify.

    This really resonates. We can write code a lot faster than we can safely deploy it at the moment.

    • marcosdumay 3 hours ago
      > We can write code a lot faster than we can safely deploy it at the moment.

      We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.

      • lelanthran 6 minutes ago
        > We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.

        That's news to me, and I'm an ancient greybeard in development.

        If you have a team of 1x f/time developer and 1x f/time tester, the tester would be spending about half their day doing nothing.

        Right now, a single developer with Claude code can very easily overwhelm even a couple of testers with new code to test.

    • acedTrex 2 hours ago
      > We can write code a lot faster than we can safely deploy it at the moment.

      This has always been the case?

  • vzaliva 1 hour ago
    There are many arguable points in this blog post, but I want to highlight just one: the need for formal specification. It is indeed a big issue. However, one must distinguish between a full specification, which is sufficient to prove functional correctness, and a specification of certain security or safety properties, which only allows us to verify those properties. For example, we can easily specify the property that "the program shall never read uninitialised memory" and prove it. That wouldn't guarantee that the program is functionally correct, but it would at least rule out a whole class of potential errors.
  • xp84 39 minutes ago
    For the verification experts: (and forgive me because I have almost zero of the math understanding of this stuff)

    > This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks.

    Is this sentiment completely discounting that there can be many possible ways to write program that satisfies certain requirements that all have correct outputs? Won’t many of these be terrible in terms of performance, time complexity, etc? I know that in the most trivial case, AI doesn’t jump straight to O(n)^3 solutions or anything, but also there’s no guarantee it won’t have bugs that degrade performance as long as they don’t interfere with technical correctness.

    Also, are we also pretending that having Claude spin for “even weeks” is free?

    • gaogao 17 minutes ago
      Yup, I've already spent like $20k using Claude to verify things, so like there's probably some room for cost cutting.
    • hun3 32 minutes ago
      Verified software should satisfy the liveness property; otherwise, an infinite loop that never returns would pass as "correct."

      Verifying realtime software goes even further and enforces an upper bound on the maximum number of ticks it takes to complete the algorithm in all cases.

  • nileshtrivedi 18 minutes ago
    > proof assistants, traditionally, don't use our classic two's complement integers packed into words in our memory, they use Peano numbers

    Why can't we just prove theorems about the standard two's complement integers, instead of Nat?

  • andrewmutz 2 hours ago
    I agree completely with the author that AI assisted coding pushes the bottleneck to verification of the code.

    But you don't really need complete formal verification to get these benefits. TDD gets you a lot of them as well. Perhaps your verification is less certain, but it's much easier to get high automated test coverage than it is to get a formally verifiable codebase.

    I think AI assisted coding is going to cause a resurgence of interest in XP (https://en.wikipedia.org/wiki/Extreme_programming) since AI is a great fit for two big parts of XP. AI makes it easy to write well-tested code. The "pairing" method of writing code is also a great model for interacting with an AI assistant (much better than the vibe-coding model).

    • 9rx 1 hour ago
      Trouble is that TDD, and formal proofs to much the same extent, assume a model of "double entry accounting". Meaning that you write both the test/proof and the implementation, and then make sure they agree. Like in accounting, the assumption is that the probability of you making the same mistake twice is fairly low, giving high confidence to accuracy when they agree. When there is a discrepancy, then you can then unpack if the problem is in the test/proof or the implementation. The fallible human can easily screw either.

      But if you only fill out one side of the ledger, so to speak, an LLM will happily invent something that ensures that it is balanced, even where your side of the entry is completely wrong. So while this type of development is an improvement over blindly trusting an arbitrary prompt without any checks and balances, it doesn't really get us to truly verifying the code to the same degree we were able to achieve before. This remains an unsolved problem.

      • altmanaltman 1 hour ago
        I don't fully understand what you mean by accounting expects the probability of making the same mistake twice is fairly low? Double-entry bookkeeping can only tell you if the books are balanced or not. We absolutely cannot assume that the books reflect reality just because they're balanced. You don't need to mess up twice to mess up the books in terms of truthness.

        Also tests and code are independent while you always affect both sides in double-entry always. Audits exist for a reason.

        • layer8 32 minutes ago
          With double-entry bookkeeping, the only way an error can slip through is if you make the same error on both sides, or else they wouldn’t be balanced. A similar thing is true for testing: If you make both an error in your test and in your implementation, they can cancel out and appear to be error-free.

          I don’t quite agree with that reasoning, however, because a test that fails to test the property it should test for is a very different kind of error than having an error in the implementation of that property. You don’t have to make the “same” error on both sides for an error to remain unnoticed. Compared to bookkeeping, a single random error in either the tests or the implementation is more likely to remain unnoticed.

        • 9rx 1 hour ago
          What about the concept of "high confidence" is not understandable?
      • andrewmutz 1 hour ago
        That sounds right in theory, but in practice my code is far, far higher quality when I do TDD than when I don't. This applies whether or not I'm using an Ai coding assistant
        • MetaWhirledPeas 1 hour ago
          I don't think GP disagrees. They are (I think) stating that AI-assisted TDD is not as reliable as human TDD, because AI will invent a pointless test just to achieve a passing outcome.
  • andai 3 hours ago
    Related discussion from last week:

    AI will make formal verification go mainstream

    https://news.ycombinator.com/item?id=46294574

  • MetaWhirledPeas 1 hour ago
    I lack the level of education and eloquence of the author, but I have my own notion that I think agrees with them: Specification is difficult and slow, and bugs do not care whether they are part of the official specification or not.

    Some software needs formal verification, but all software needs testing.

    On another subject...

    > Tests are great at finding bugs ... but they cannot prove the absence of bugs.

    I wish more people understood this.

    • enceladus76 9 minutes ago
      Unless people therefore decide that testing unnecessary... Which has happened a lot in academia. One of the reasons testing is not being taught that well on some universities...
  • aidenn0 3 minutes ago
    See also Regehr's example[1] where a formally verified C compiler generates incorrect output because of an inconsistent value in <limits.h> (TL;DR: The compiler can pick whether "char" is signed or unsigned. Compcert picked one, but the linux system header used the other for CHAR_MIN and CHAR_MAX).

    1: https://blog.regehr.org/archives/482 there were many issues here, not just with compcert

  • hacker_homie 24 minutes ago
    Test, just don’t verify.

    How I learned to deploy faster.

  • ecocentrik 2 hours ago
    Doesn't this run into the same bottleneck as developing AI first languages? AI need tons of training material for how to write good formal verification code or code in new AI first languages that doesn't exist. The only solution is large scale synthetic generation which is hard to do if humans, on some level, can't verify that the synthetic data is any good.
  • throw-12-16 1 hour ago
    people cant even bother to write code and you expect them to test it?
  • CuriouslyC 3 hours ago
    Formal verification is a nice idea but it's a big hill to climb from where we're at. Most people can't even get agents to robustly E2E QA code, which is a much smaller hill to climb for (probably) larger benefits. I'm sure this area will improve over time though, since it is an eventual unlock for fully autonomous engineering.
    • __MatrixMan__ 2 hours ago
      I think for most complex systems, robust E2E QA is a waste of money. A small handful of E2E smoke tests and thoughtful application of smaller tests is usually enough. Though to be fair, agent aren't good at that either.
    • smccabe0 2 hours ago
      I think the section on AI from Zero to QED (a proofs in Lean/lang guide) gives a sober path forward from the perspective of market-makers and trading:

      "Imagine market infrastructure where agents must prove, before executing, that their actions satisfy regulatory constraints, risk limits, fairness properties, and eventually machine-checkable proofs of Pareto efficiency of market mechanisms. This is a big, hairy, ambitious goal. Not “we reviewed the code” but “the system verified the proof.” The agent that cannot demonstrate compliance cannot act."

      https://sdiehl.github.io/zero-to-qed/20_artificial_intellige...

      • TeodorDyakov 2 hours ago
        I dream of a future where before any software is released we can predict 100 years into the future what effect it will have on every living thing and not release it if unhappiness delta for some living thing falls below a certain threshold.
        • layer8 27 minutes ago
          That would require making the behavior of the living things part of the formal specification, which isn’t really possible.
  • visarga 1 hour ago
    At the end of the day, you either validate every line of code manually, or you have the agent write tests. Automate your review.
  • baq 2 hours ago
    We won't be formally verifying millions of LOC anytime soon, don't get your hopes that high up.

    ...but we will be modelling those 5-10kLOC modules across multiple services doing critical business logic or distributed transactions. This has been unthinkable a couple months ago and today is a read-only-Friday experiment away (try it with a frontier model and you'll be surprised).

  • esafak 3 hours ago
    Alperen,

    Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.

    Good luck with your degree!

    P.S. Some links in your Research page are placeholders or broken.

    • alpaylan 3 hours ago
      I'll add some links for the original VGD paper and related articles, that should help in short term. Thank you! I'll look into writing something on VGD itself in the next few weeks.
  • badgersnake 3 hours ago
    > AI is making formal verification go mainstream.

    This nonsense again. No. No it isn’t.

    I’m sure the people selling it wish it was, but that doesn’t make it true.

    • baq 2 hours ago
      You haven't been paying attention.

      The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November. Opus 4.5 and GPT 5.2 have crossed an usefulness frontier.

      Anecdotally, I've been having some success (guiding LLMs) writing Alloy models in the past month and ensuring conformance with code. Making these would've been unjustifiable from ROI perspective fairy tales just this summer. The landscape has changed qualitatively.

    • AnimalMuppet 3 hours ago
      LLM-style AI isn't great for formal verification, not so far as I understand. And the recent advances in AI didn't do much for the kind of AI that is useful for formal verification.
      • otterley 2 hours ago
        You don’t use AI to perform formal verification. You give the agent access to verification tools whose output can then be fed back to the model.

        It’s the same design as giving LLMs the current time, since they can’t tell time themselves, either.

    • whatisthishere 3 hours ago
      [dead]
  • whatisthishere 3 hours ago
    [dead]
  • sapphirebreeze 2 hours ago
    [dead]
  • omgJustTest 2 hours ago
    my user should get upvotes for this :)