LLMs Are Tools, Not Replacements

I’ve been meaning to write this post for a bit, but never found the right time. I guess this is it. Until sometime last year, I was more or less an AI-skeptic. I say more or less because I was always very interested in the technology. I built my own LLM to learn about it and I thought then, as I do now, that the technology is incredible.

And yet, I had tried using LLMs to help with coding and my experiences were not great. I used LLMs to write one-off scripts for me, they were very good at that. But whenever I tried to use them to help me write “production code”, they would hallucinate or get stuck in “bug loop”. I felt like I was spending more time dealing with the aftermath than I’d do writing it all by hand. I even disabled Copilot autocomplete because I felt like it was distracting.

Fast forward to today and most of my code is written by LLMs. How this change happened is a combination of how much the tooling improved but also the recognition that I was holding it wrong.

Now, don’t get me wrong. This post is not meant to convince anyone of anything. I’m not selling anything here. This post is for engineers who are curious about how others work with LLMs and trying to find their own workflow. I’ll show you exactly how I work now and how it works for me.

The bug that changed my mind

As mentioned, I was a bit of a skeptic. I knew LLMs were good at writing one-off scripts and I was using them a lot for that, but not more than that. Then one day someone asked for help with a bug.

We had this multicell architecture and we had a proxy/multiplexer that would decide where any given request should be routed to. Once that decision was made, the request would be proxied to an ALB using a custom transport. The ALB had resource mappings to know where inside a given cell things were hosted, so the custom transport requested a URL from the ALB, the ALB responded with a redirect to the actual destination inside the cell it belonged to. The custom transport would require the request and make it to the correct destination.

The bug: seemingly at random, some requests would succeed and some would not and no one could figure out why. So I started looking and quickly found that it wasn’t random at all: requests with bodies would fail. When I saw that, I immediately thought it was the custom transport eating the body, except I remembered writing that transport and found it hard to believe the issue was there. And upon looking at the code, it seemed fine. I added logging and went about trying to reproduce the issue. The code seemed correct, but the issue was still there.

After a while, I decided to try Claude Code. I launched it on the repo and explained the problem. I’ll admit I did not have high expectations, but hoped that maybe it could give me some insight that would help. To my surprise, in about 40s it came back saying it had found the issue: the transport was eating the request bodies. My first reaction was being frustrated because I knew I had already looked at it and the issue was not there. I thought Claude was being dumb. Except I noticed it was showing code that didn’t look like what I was looking at. Long story short: at some point, someone had copied and pasted some code and added a second custom transport somewhere where it shouldn’t, and that transport had a bug.

I didn’t fully convert then, but I started paying more attention. I began using LLMs for debugging and code reviews, things where being wrong was mostly harmless and I could verify the output easily. Over time, that expanded. Now we’re here.

The mistake I made early on

When I first tried AI coding tools, I treated them like code generators. Describe what you want, get code back, paste it in, repeat. This was the intuitive way to use them, and it’s wrong as far as I am concerned.

For those one-off scripts I mentioned before, I recognize now that I was “vibe coding” them. But that was fine because they were only going to be used by me. But I don’t let LLMs write unsupervised code that I need to ship for others. So the problem is that generated code requires review. Review requires understanding. If you didn’t think through the implementation yourself, you’re now reading code you don’t fully understand, looking for bugs you can’t anticipate, in an approach you didn’t choose. You’re doing more cognitive work than if you’d just written it yourself, and the code is probably worse.

The mental shift that made everything click for me was that LLMs are tools, just like LSPs were tools, and pre-LLM autocomplete was a tool. They’re not a replacement, but a complement. A junior engineer who has read everything but never built anything. Lots of talent but absolutely not trusted unsupervised.

My workflow

This is how I work with LLMs. I found that this works very well for me. I am aware that it is a much more involved workflow than a lot of people’s.

Phase 0: Reverse Rubber-ducking

I don’t start in an agent. I start in Claude, just chatting.

Before I write any code, I want to understand the domain. If I’m implementing auto-updates for a macOS app, I am asking Claude about how Sparkle works. Not “implement auto-updates for me”, but “how does Sparkle choose when to prompt the user?” or whatever. I want to know the concepts, gotchas, tradeoffs, etc. I often talk about some other app and ask “how does X do this?”

This is basically rubber-ducking in reverse. I’m building my own mental model through conversation. By the time I’m ready to touch the code, I actually understand what I’m about to do. This matters because it means I now can review what the LLM produces. I develop an intuition for what to expect, which in turn lets me quickly spot when something is wrong.

This phase gives me confidence, and that matters. And of course, this is mostly for areas I am not already familiar with. But even when am familiar, I find that these conversations give me insights or what I need to ask when doing the plan.

Phase 1: Plan

Now I move to an agent. Lately I’ve been using Amp, but the specific tool matters less than the process. This could be Claude Code, Codex CLI, etc. My process is tool-agnostic.

I don’t say “build me X.” Instead, I start another conversation, mostly a Q&A. “How would you approach this?”, “What are the steps?”, etc. I challenge it when something sounds off. I often ask the LLM to pushback to my ideas if it thinks they’re not good. I may still insist but it’s good to have some pushback here and there. We go back and forth until I’m satisfied with the approach.

Then I ask it to split the plan into the smallest self-contained, testable phases. This is critical. I want each phase to be something I can review, run, and validate before moving on. Those codebase-wide big changes re where things go off the rails.

Finally, I have it write everything to a spec.md file. This serves two purposes: (1) it’s a reference I can point the LLM to if context gets lost, and (2) it’s documentation of what we decided and why. For longer projects, this is how I resume after a break. I also make manual adjustments to this plan when needed, though this is getting more and more rare.

Phase 2: Implement each phase of the plan, one by one

Now the agent starts writing code, one phase at a time.

I watch the diffs as they flow in and because I was part of the planning and did my homework in Phase 0, I know what to expect. A quick glance usually is enough to tell me if it’s writing what we discussed or going off-script. That’s why the prep work matters: review is fast when you understand what you’re looking at.

I also give it context to save time. The agents nowadays are very smart and can find their way, but I can shortcut that by giving it hints “in internal/foo/foo.go there’s a function called DoFoo() and it does this and that and I want it to do that other thing before that” or whatever. Less tokens, faster iteration. This is probably astrology for nerds, pure superstition at this point, but I still do it.

Here’s a little trick I’ve started using: cross-agent reviews. Once Amp finishes a phase, I’ll ask Claude Code or Codex to review the diff. Different models and harnesses catch different things. It’s not foolproof, but it’s cheap and occasionally catches something I missed.

Phase 3: Validation, commit, and handoff

Once a phase looks good, I test it. I run and do what I can to validate it. I’ve mostly reviewed the code both by myself and using an LLM.

If something is wrong, I iterate with the agent. I point out the problem and let it fix it. This usually works and only very occasionally I have to take over and fix it myself.

When I’m happy, I commit. This is an easy rollback point if something goes wrong afterwards. At this point I use Amp’s /handoff command to start a fresh context for the next phase. This is a forced boundary: the agent will start clean (though it can reference the previous phase in Amp), it will re-read the spec and we continue. This helps prevent context rot, which is where long sessions start to drift.

Trust Boundaries

I rely on LLMs heavily but I don’t trust them.

These are the lines I don’t let them cross:

Nothing ships without my review. I read every line before it goes in. I am too anxious to ship something I don’t understand. That prep work from Phase 0 is not just about understanding, but about making review fast enough that this is sustainable
Don’t let the LLM write tests unsupervised. I learned this one the hard way. When a test fail, LLMs often “fix” the test to make it pass. I’ve heard this is less likely nowadays but I’ve been burned and trust isn’t easily restored. So there. Now I’m extremely careful about letting them modify test code. Only thing I do like to use LLMs for in testing is asking them “do the tests cover the case where this, this, and this happen?” Helps finding holes in the coverage.
Debugging is still mostly me. This is ironic, given that debugging a bug was my entry point into using LLMs more and more, but I’ve found that for my day-to-day debugging, I’m usually faster on my own. I reach for an LLM if I’m stuck, not as a first resort. Maybe this is muscle memory or maybe the tooling is weaker here. Either way, I don’t force it.

What still doesn’t work well

I want to be honest about the limitations, because the hype around these tools is exhausting.

I don’t think they’re good at complex refactoring across many files. The agent loses the thread. It will make changes that are locally correct but globally inconsistent. For big refactors, I still do a lot of manual work. I feel like the quality of code after an LLM-assisted refactor is not great quality.

Also, anything requiring deep context about the codebase’s history. Why is this weird workaround here? What’s the implicit contract this function has with its callers? The agent doesn’t know, heck most people don’t either, but whereas a human might be reluctant, LLMs will happily remove that code that seemed inconsequential but that now breaks some contract with a client.

And the final one can be controversial, but I think they’re bad at novel architecture decisions. Don’t get me wrong, ask an LLM to design something and it will, but then you ask it “oh but what if…” and it will immediately “yes good point” and redesign it all. It just goes along with whatever you last said. It doesn’t know how to make decisions. It shouldn’t be surprising given how LLMs work, but our brains tend to anthropomorphize everything and then these things become counterintuitive. So I still have to think about architecture myself.

The Real Lesson

These tools have changed a lot — GPT 5.2 and Opus 4.5 are watershed moments IMO — but not as much as my own approach did. I stopped trying to skip the thinking part and started using LLMs to enhance it. The agent participates in discovery, planning, obviously implementation, and also reviews, but I am still driving.

If you bounced off these tools, it might be worth trying again with a different approach, it’s all I’m saying.

I’ve found that my workflow is more work upfront, but dramatically less work overall. More importantly, it lets me focus on the interesting parts and helps me with the drudgery.