1 comments

  • kouteiheika 1 hour ago
    It'd be nice if you could add a separate leaderboard for open-weight models on your results page (or add the ability to filter-out proprietary models).

    Also, why use an agent for this? This doesn't make much sense to me, considering it's supposed to be "measuring how well models can find and fix errors in human-written text" -- here you're just as much measuring the model's agentic capabilities as you're measuring its ability to correct the text.

    I suppose this is somewhat of an interesting benchmark too, but if I were interested in cost-effective proofreading of a ton of text I'd just do it the old fashioned way: split my text into chunks, write a nice prompt telling the model to proofread the given text and return me the result, attach the prompt to each chunk of text to proofread, and let it rip.