3 comments

  • liamwire 1 hour ago
    You've put a lot into this, that much is clear. However, this suffers from the usual problem of the era of abundant bespoke tooling: It's hard to figure out what this even does. I read the README, the examples, and the scoring function, but I still couldn't easily articulate this in a meaningful way to a third person. If you want adoption, you need to solve for this first.

    Your README's first few lines, which is as far as you should expect most people will go, mentions a 2-minute explainer video. But it's actually 45 seconds. Why say otherwise? Hyperbole, maybe, but to me it raises the question of whether any of this was QC'd by a human at all before publishing. If your headline marketing material is in question, I'm inclined to make assumptions about the rest of it as well.

    Edit: I should add, I'm glad I didn't check out your website before commenting, because I probably would've been too intimidated to comment. My career and expertise wouldn't measure up to yours. I do stand by my thoughts though, I think we often get so deep into our own domain and needs that we can briefly lose sight of our average audience. I’ll try this out myself on a website repo I'm updating and share how it went later on.

    Part 2: Using it went pretty well, in my case there wasn't much in the way of improvements identified, but it's a fairly simple static Astro site with minimal JS used to market a business, so there's far less surface area than this is maybe intended for. It looks like it works well. Tighten up the messaging on the repo and I think you've got a good tool.

  • jmilinovich 10 hours ago
    I had 30 broken Playwright tests and no way to tell which ones actually mattered. The problem wasn’t “fix the tests” — it was that there’s no coverage tool for test infrastructure trustworthiness. I had to build the ruler before I could measure anything.

    So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.

    The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.

    The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.

    Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:

    Read github.com/jmilinovich/goal-md — read the template and examples. Then write me a GOAL.md for this repo and start working on it.

    Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.

    • DANmode 3 hours ago
      Thanks for sharing how this came to be - really sets the post apart.
      • jmilinovich 2 hours ago
        Appreciate that! This fast feedback loop is what I like most about personal software.
  • derwiki 3 hours ago
    Seems like the same utility as autoresearch, but much simpler. Does autoresearch work better?
    • jmilinovich 2 hours ago
      Autoresearch is awesome for stuff that has really clear loss functions but most problems don’t have that. So if you’re trying to improve product quality or write great docs you can use goal.md