Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.
Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.
This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.
This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.
Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.
You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).
You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?
Just stop. You need to take some time and run some Google searches before 2022 on HN and just read how blind sided this community has been by AI. Take the L, like everyone else.
Quite frankly, discussion about LLMs being hype is not useful for an innovation focused community, especially when it’s utterly wrong. The AI-is-hype people are so wrong, oh so wrong. This will be fun to revisit in just a year.
Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.
I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.
You have to be able to see what this thing can actually do, as opposed to what it can’t.
Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints.
It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not."
After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
> I wanted it to refactor a parser in a small project
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).
Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.
Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?
>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.
Obviously generated code drift a little from deleted ones.
This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.
I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
From the article:
> I contest the idea that LLMs are replacing human devs...
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
> How is it not clear that it would be beneficial?
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it.
I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable".
Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff.
I also got good results for “above CRUD” stuff occasionally. Sorry if I wasn’t clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I don’t believe LLMs are completely useless for non-trivial stuff, nor I believe that they won’t get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLM’s structure itself, but can be solved with augmentation.
In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
Most developers are also bad at asking questions. They tend to assume too many things from the start.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
Agreed in a general sense, but there's a bit more nuance.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocks—they just regenerate from learned patterns. I've noticed similar vibes when agents refactor—they'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF.
I feel like it’s the opposite: the copy-paste issue is solvable, you just need to equip the model with the right tools and make sure they are trained on tasks where that’s unambiguously the right thing to do (for example, cases were copying code “by hand” would be extremely error prone -> leads to lower reward on average).
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
> On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
Codex has got me a few times lately, doing what I asked but certainly not what I intended:
- Get rid of these warnings "...": captures and silences warnings instead of fixing them
- Update this unit test to relfect the changes "...": changes the code so the outdated test works
- The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post:
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before...
The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply.
LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task".
In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesn’t that “fix” the copy-&-paste issue?
Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now.
That's interesting, and I haven't, but as long as the IDE has an API for the refactoring action, giving an agent access to it as a tool should be pretty straightforward. Great idea.
“They’re still more like weird, overconfident interns.”
Perfect summary. LLMs can emit code fast but they don’t really handle code like developers do — there’s no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can “copy-paste” both code and context with intent, they’ll stay great at producing snippets and terrible at collaborating.
This is exactly how we describe them internally: the smartest interns in the world. I think it's because the chat box way of interacting with them is also similar to how you would talk to someone who just joined a team.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
@kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps?
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.
I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.
Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?
Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?
You don't learn new languages/paradigms/frameworks by inserting it into an existing project.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
> LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly.
I am guessing this because:
1. Most of the training material online references Unix commands.
2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note:
Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
This was allowed to go to master without "git diff" after Codex was done?
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
Can you spot the next problem introduced by this?
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Quite frankly, discussion about LLMs being hype is not useful for an innovation focused community, especially when it’s utterly wrong. The AI-is-hype people are so wrong, oh so wrong. This will be fun to revisit in just a year.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Stop being so small minded.
Perhaps you’ve been sold a lie?
You have to be able to see what this thing can actually do, as opposed to what it can’t.
But all code is "long precise strings".
Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result".
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"
Of course, now I have to check if someone has done this already.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
Obviously generated code drift a little from deleted ones.
I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
Both are right.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?
https://cursed-lang.org/
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Impressive nonetheless.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
Absolutely. I do not underestimate this.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
It makes sense to me, at least.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
why overengineer? it's super simple
I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code"
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
- Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
“Fix the issues causing these warnings”
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
You can't fix it.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
[0]: https://github.com/aperoc/toolkami
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
So there's hope.
But often they just delete and recreate the file, indeed.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
Typo or trolling the next LLM to index HN comments?
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
Oh, sorry. You already said that. :D
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
I am guessing this because:
1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
_Did you ask it to ask questions?_
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)