Blog

Updated LLM Coding Workflow

Back in January I posted about how I view LLMs, which included my workflow of doing LLM-assisted coding. To summarize, my workflow was:

  1. Reverse rubber-ducking
  2. Planning and writing a spec file
  3. Implement each phase of the plan, one by one
  4. Validation and commit

I can say lots have changed since January. For one, models are significantly better and more reliable. As well, I feel like I’ve got better at steering them. If I look at the above list by itself, without details, it doesn’t feel like things changed so much, but look closer and it’s a whole new world. My current workflow is an evolution of the above.

Before continuing, let me make something clear: this is my professional workflow. It’s what I use to write production code on cloud services.

Phase 0: Reverse Rubber-ducking

I still have this, but I no longer use a chat interface. I start directly with an agent (almost always OpenCode) and I now first get the agent to engage with the code before anything else. Let’s say I need to make a change to the flux capacitors, so I go and tell the agent what I think happens:

This is cloud-service-foo and it handles requests to create farbelizer connectors. I believe it then nimbolizes the farbelizers before sending them to cloud-service-bar that processes them through the flux capacitors. Check what the actual flow is and summarize it for me.

Most of the times – though not always – I know exactly how the flow works, but I do this to sort of prime the LLM for discussing what I want to change. I just found that it tends to work well for me; better then just telling it directly what I want to change.

An indirect effect of doing this is that sometimes it will tell me something that doesn’t meet my understanding, so I ask details to figure out if it’s really something I missed or just something the LLM got wrong.

When I know the LLM has the context, I will say something like

I have an issue where if the farbelizer connector starts with “foo”, then the flux capacitors should suppress the harmonic back-feeding before it reaches the primary gimbal housing.

I often add a little more about what the actual goal is, but it’s something like this. This usually causes the LLM to tell me what it thinks should be done. It will sometimes ask a question or two, but eventually it will give me a solution. Many times the solution is one I know I don’t want due to some constraint and I will tell it. Sometimes I steer it a bit more and tell it what I think we should do.

And then when I’m happy, I’ll say:

Plan a series of independent PRs to implement this. List which ones can be done in parallel vs linearly

That’s it. That’s the entire plan phase now. I no longer need a spec file.

Phase 1: Implementation

Given the list of PRs, I will ask it to implement them either a few in parallel or, if there’s a dependency, one by one. I also now let the LLM agent commit its changes. (if they were in parallel, I also let it push the changes.)

Again, that’s it?

Phase 2: Validation

I then test the work locally, I still insist on doing that because I don’t want to cause a SEV. I’ll then push the branches and review the diffs myself in GitHub before asking for others to review: I want to avoid wasting people’s time.

You still have to watch them

I had an interesting interaction with GPT 5.5 a few weeks ago, where it wrote code that was akin to this:

attempt := 0

for {
	attempt++
	if attempt > maxAttempts {
		return errTooManyAttemps
	}
	err := fetchData(ctx)
	if err != nil {
		switch {
		case errors.Is(err, errTimeout):
			fmt.Println("Warning: Timeout occurred. Retrying...")

		case errors.Is(err, context.Canceled):
			fmt.Println(" -> Context was canceled. Exiting...")

		default:
			return err
		}
		
		time.Sleep(100 * time.Millisecond) 
	}
}

When I saw that, I immediately knew it didn’t look right, so I asked the agent about the case with the context.Canceled and it happily explained to me that it would log the error and then return with default:. I said, no, it won’t, that’s not how Go switches work. And it insisted! “I understand your confusion, but because there is no break statement, the code will simply fall through the next case.”

No, it forking won’t! So I told it to prove it by writing a test that returned a context canceled. It did, caught the infinite loop and conceded.

My point? They can still make mistakes. I have to check them.

Conclusion

That said, I will concede that the LLMs are so much better now and that these errors are getting more and more rare. My flow is much quicker than before. I still review code like a caveman, I still make sure the LLM gets what I want it to do. But I basically killed the entire “plan” step. It’s just not needed. And I almost never write code by hand.

Your App Subscription Is Now My Weekend Project

I pay for a lot of small apps. One of them was Wispr Flow for dictation. That’s $14 CAD/month that I was paying until I had a few lazy days visiting my mother. And then on the afternoon of New Year’s Day, I vibecoded Jabber.

Now, don’t get me wrong, Jabber is not “production quality.” I would never sell it as a product or even recommend it to other people, but it does what I needed from Wispr Flow, and it does exactly the way I want it to. For free.

At work, I’m often asked to make small videos showing some support agent how something works, or sharing some knowledge with new team members, or just a regular demo of something. In the past, I used to use Loom, which costs $15/month. So after creating Jabber, I got excited and vibecoded Reel.

Reel does exactly what I wanted Loom to do: I can record my camera, I can move it around, and I get to trim the video after it’s done (I don’t remember being able to do that with Loom).

Then just yesterday, a friend of mine was telling me how he got tired of paying for Typora and decided to vibecode his own Markdown editor. And that gave me the idea of creating an editor for my blog.

That’s Hugora! Yes, horrible name, but who cares? It’s just for me. I get to edit my Hugo blog just the way I like. It even shows my site theme.

You see the pattern here?

All of these $10/month apps are suddenly a weekend project for me. I’m an engineer, but I have never written a single macOS application. I’ve never even read Swift code in my life, and yet, I now can get an app up and running in a couple of hours. This is crazy.

Last year, a Medium post predicted:

Most standalone apps will be “features, not products” in the long run — easy to copy and bundle into larger offerings.

And I think we’re there. I don’t know what that means for the future of our industry, but it does seem like a big shift.

I’m still skeptical of vibecoding in general. As I mentioned above, I would not trust my vibecoding enough to make these into products. If something goes wrong, I don’t know how to fix it. Maybe my LLM friends can, but I don’t know. But vibecoding is 100% viable for personal stuff like this: we now have apps on demand.

Being Specific when Pairing with Bots

A couple of days ago I posted about my workflow and I made light fun of something I do:

I also give it context to save time. The agents nowadays are very smart and can find their way, but I can shortcut that by giving it hints “in internal/foo/foo.go there’s a function called DoFoo() and it does this and that and I want it to do that other thing before that” or whatever. Less tokens, faster iteration. This is probably astrology for nerds, pure superstition at this point, but I still do it.

Turns out, maybe it’s not really astrology for nerds? Today, Quinn Slack shared an article about How To Pair with an Agent and in it, the author says “The more you can specify, the better” and gives this example of a good, specified prompt:

Specified prompt:

Build a new API endpoint for user notifications. Follow the pattern in src/api/messages.ts as your reference. Run the API tests after each step. Don’t move on until they pass.

You gave it a reference to follow and a way to check its own work. Now you can step away. Let the agent iterate until the tests pass.

So maybe it’s not astrology for nerds, but a good practice? I always felt like it gave me better results, but I wasn’t 100% sure if it wasn’t just a superstition. Nice to see it’s probably not.

Thoughts on Amp's ad-supported business model

I’m agent-agnostic, in which I don’t use only one. I keep changing from time to time. My earliest forays into agentic programing were with Claude Code, which was then and still is probably the gold standard. But since then I’ve tried quite a few: Codex CLI (good models, barebones agent), Droid (not a fan), OpenCode (big fan!), and Amp.

I’ve been using Amp for a while, but only from time to time to see its evolution. With the move to Opus 4.5, I found that Amp has become very capable and I started using it more and more until it became my go-to agent.

The one downside of Amp is that it can get expensive. Since it’s not tied to any of the labs, it needs to charge API pricing, which can get expensive if you use it a lot, though maybe less than most people think.

But it’s undeniable that there is a psychological impact at seeing money constantly being drawn as you use, even if at the end of the day you’d spend the same as with a subscription.

These are challenges the Amp people have been working on for a while. Then not that long ago, they came up with a first attempt: an ad-supported free tier. You could use Amp up to $10 worth of API per day as long as you agreed to see ads on your agent.

To be clear, ads are optional. You only see the ads if you choose to and if you do, you get $10 worth of inference per day, using some cheaper models. This is how the ads appear:

Personally, I find them unobstructive, but opinions may vary. Now, you may notice that I emphasized the fact that these ads are options. The reason why I did so is that it appears that the idea of ads hits a nerve in some people. Ever since their free tier came out, I’ve seen several tweets of people announcing their refusal to ever try Amp because they don’t want ads.

Now Amp came up with a next step, whereas paying customers can also enable Amp Free and get those $10 a day of inference and their balance will only be drawn from once they exhaust their free allowance. That’s the equivalent of $300 a month of free inference. Not only that, but you can use that free allowance with Opus 4.5.

I understand the aversion to ads. I share it. But $300/month is an incredible value to ignore. So I decided to enabled Amp Free and use Amp exclusively this month to see how much I will have spent by the end of the month. My suspicion is that I’ll spend less than my Claude Max subscription.

But something that I think is very important to remember is that there’s a whole world of engineers outside of the developed world. For a developer in, say, South America, the cost of a Claude or Codex subscription is prohibitive. This is keeping a whole world of engineers out of the LLM revolution. Amp’s approach offers a way for them to have access to premium models they otherwise wouldn’t have access to.

LLMs Are Tools, Not Replacements

I’ve been meaning to write this post for a bit, but never found the right time. I guess this is it. Until sometime last year, I was more or less an AI-skeptic. I say more or less because I was always very interested in the technology. I built my own LLM to learn about it and I thought then, as I do now, that the technology is incredible.

And yet, I had tried using LLMs to help with coding and my experiences were not great. I used LLMs to write one-off scripts for me, they were very good at that. But whenever I tried to use them to help me write “production code”, they would hallucinate or get stuck in “bug loop”. I felt like I was spending more time dealing with the aftermath than I’d do writing it all by hand. I even disabled Copilot autocomplete because I felt like it was distracting.

Fast forward to today and most of my code is written by LLMs. How this change happened is a combination of how much the tooling improved but also the recognition that I was holding it wrong.

Now, don’t get me wrong. This post is not meant to convince anyone of anything. I’m not selling anything here. This post is for engineers who are curious about how others work with LLMs and trying to find their own workflow. I’ll show you exactly how I work now and how it works for me.

The bug that changed my mind

As mentioned, I was a bit of a skeptic. I knew LLMs were good at writing one-off scripts and I was using them a lot for that, but not more than that. Then one day someone asked for help with a bug.

We had this multicell architecture and we had a proxy/multiplexer that would decide where any given request should be routed to. Once that decision was made, the request would be proxied to an ALB using a custom transport. The ALB had resource mappings to know where inside a given cell things were hosted, so the custom transport requested a URL from the ALB, the ALB responded with a redirect to the actual destination inside the cell it belonged to. The custom transport would require the request and make it to the correct destination.

The bug: seemingly at random, some requests would succeed and some would not and no one could figure out why. So I started looking and quickly found that it wasn’t random at all: requests with bodies would fail. When I saw that, I immediately thought it was the custom transport eating the body, except I remembered writing that transport and found it hard to believe the issue was there. And upon looking at the code, it seemed fine. I added logging and went about trying to reproduce the issue. The code seemed correct, but the issue was still there.

After a while, I decided to try Claude Code. I launched it on the repo and explained the problem. I’ll admit I did not have high expectations, but hoped that maybe it could give me some insight that would help. To my surprise, in about 40s it came back saying it had found the issue: the transport was eating the request bodies. My first reaction was being frustrated because I knew I had already looked at it and the issue was not there. I thought Claude was being dumb. Except I noticed it was showing code that didn’t look like what I was looking at. Long story short: at some point, someone had copied and pasted some code and added a second custom transport somewhere where it shouldn’t, and that transport had a bug.

I didn’t fully convert then, but I started paying more attention. I began using LLMs for debugging and code reviews, things where being wrong was mostly harmless and I could verify the output easily. Over time, that expanded. Now we’re here.

The mistake I made early on

When I first tried AI coding tools, I treated them like code generators. Describe what you want, get code back, paste it in, repeat. This was the intuitive way to use them, and it’s wrong as far as I am concerned.

For those one-off scripts I mentioned before, I recognize now that I was “vibe coding” them. But that was fine because they were only going to be used by me. But I don’t let LLMs write unsupervised code that I need to ship for others. So the problem is that generated code requires review. Review requires understanding. If you didn’t think through the implementation yourself, you’re now reading code you don’t fully understand, looking for bugs you can’t anticipate, in an approach you didn’t choose. You’re doing more cognitive work than if you’d just written it yourself, and the code is probably worse.

The mental shift that made everything click for me was that LLMs are tools, just like LSPs were tools, and pre-LLM autocomplete was a tool. They’re not a replacement, but a complement. A junior engineer who has read everything but never built anything. Lots of talent but absolutely not trusted unsupervised.

My workflow

This is how I work with LLMs. I found that this works very well for me. I am aware that it is a much more involved workflow than a lot of people’s.

Phase 0: Reverse Rubber-ducking

I don’t start in an agent. I start in Claude, just chatting.

Before I write any code, I want to understand the domain. If I’m implementing auto-updates for a macOS app, I am asking Claude about how Sparkle works. Not “implement auto-updates for me”, but “how does Sparkle choose when to prompt the user?” or whatever. I want to know the concepts, gotchas, tradeoffs, etc. I often talk about some other app and ask “how does X do this?”

This is basically rubber-ducking in reverse. I’m building my own mental model through conversation. By the time I’m ready to touch the code, I actually understand what I’m about to do. This matters because it means I now can review what the LLM produces. I develop an intuition for what to expect, which in turn lets me quickly spot when something is wrong.

This phase gives me confidence, and that matters. And of course, this is mostly for areas I am not already familiar with. But even when am familiar, I find that these conversations give me insights or what I need to ask when doing the plan.

Phase 1: Plan

Now I move to an agent. Lately I’ve been using Amp, but the specific tool matters less than the process. This could be Claude Code, Codex CLI, etc. My process is tool-agnostic.

I don’t say “build me X.” Instead, I start another conversation, mostly a Q&A. “How would you approach this?”, “What are the steps?”, etc. I challenge it when something sounds off. I often ask the LLM to pushback to my ideas if it thinks they’re not good. I may still insist but it’s good to have some pushback here and there. We go back and forth until I’m satisfied with the approach.

Then I ask it to split the plan into the smallest self-contained, testable phases. This is critical. I want each phase to be something I can review, run, and validate before moving on. Those codebase-wide big changes re where things go off the rails.

Finally, I have it write everything to a spec.md file. This serves two purposes: (1) it’s a reference I can point the LLM to if context gets lost, and (2) it’s documentation of what we decided and why. For longer projects, this is how I resume after a break. I also make manual adjustments to this plan when needed, though this is getting more and more rare.

Phase 2: Implement each phase of the plan, one by one

Now the agent starts writing code, one phase at a time.

I watch the diffs as they flow in and because I was part of the planning and did my homework in Phase 0, I know what to expect. A quick glance usually is enough to tell me if it’s writing what we discussed or going off-script. That’s why the prep work matters: review is fast when you understand what you’re looking at.

I also give it context to save time. The agents nowadays are very smart and can find their way, but I can shortcut that by giving it hints “in internal/foo/foo.go there’s a function called DoFoo() and it does this and that and I want it to do that other thing before that” or whatever. Less tokens, faster iteration. This is probably astrology for nerds, pure superstition at this point, but I still do it. (Hi, it’s me, from the future: maybe it’s not astrology?)

Here’s a little trick I’ve started using: cross-agent reviews. Once Amp finishes a phase, I’ll ask Claude Code or Codex to review the diff. Different models and harnesses catch different things. It’s not foolproof, but it’s cheap and occasionally catches something I missed.

Phase 3: Validation, commit, and handoff

Once a phase looks good, I test it. I run and do what I can to validate it. I’ve mostly reviewed the code both by myself and using an LLM.

If something is wrong, I iterate with the agent. I point out the problem and let it fix it. This usually works and only very occasionally I have to take over and fix it myself.

When I’m happy, I commit. This is an easy rollback point if something goes wrong afterwards. At this point I use Amp’s /handoff command to start a fresh context for the next phase. This is a forced boundary: the agent will start clean (though it can reference the previous phase in Amp), it will re-read the spec and we continue. This helps prevent context rot, which is where long sessions start to drift.

Trust Boundaries

I rely on LLMs heavily but I don’t trust them.

These are the lines I don’t let them cross:

  • Nothing ships without my review. I read every line before it goes in. I am too anxious to ship something I don’t understand. That prep work from Phase 0 is not just about understanding, but about making review fast enough that this is sustainable
  • Don’t let the LLM write tests unsupervised. I learned this one the hard way. When a test fail, LLMs often “fix” the test to make it pass. I’ve heard this is less likely nowadays but I’ve been burned and trust isn’t easily restored. So there. Now I’m extremely careful about letting them modify test code. Only thing I do like to use LLMs for in testing is asking them “do the tests cover the case where this, this, and this happen?” Helps finding holes in the coverage.
  • Debugging is still mostly me. This is ironic, given that debugging a bug was my entry point into using LLMs more and more, but I’ve found that for my day-to-day debugging, I’m usually faster on my own. I reach for an LLM if I’m stuck, not as a first resort. Maybe this is muscle memory or maybe the tooling is weaker here. Either way, I don’t force it.

What still doesn’t work well

I want to be honest about the limitations, because the hype around these tools is exhausting.

I don’t think they’re good at complex refactoring across many files. The agent loses the thread. It will make changes that are locally correct but globally inconsistent. For big refactors, I still do a lot of manual work. I feel like the quality of code after an LLM-assisted refactor is not great quality.

Also, anything requiring deep context about the codebase’s history. Why is this weird workaround here? What’s the implicit contract this function has with its callers? The agent doesn’t know, heck most people don’t either, but whereas a human might be reluctant, LLMs will happily remove that code that seemed inconsequential but that now breaks some contract with a client.

And the final one can be controversial, but I think they’re bad at novel architecture decisions. Don’t get me wrong, ask an LLM to design something and it will, but then you ask it “oh but what if…” and it will immediately “yes good point” and redesign it all. It just goes along with whatever you last said. It doesn’t know how to make decisions. It shouldn’t be surprising given how LLMs work, but our brains tend to anthropomorphize everything and then these things become counterintuitive. So I still have to think about architecture myself.

The Real Lesson

These tools have changed a lot — GPT 5.2 and Opus 4.5 are watershed moments IMO — but not as much as my own approach did. I stopped trying to skip the thinking part and started using LLMs to enhance it. The agent participates in discovery, planning, obviously implementation, and also reviews, but I am still driving.

If you bounced off these tools, it might be worth trying again with a different approach, it’s all I’m saying.

I’ve found that my workflow is more work upfront, but dramatically less work overall. More importantly, it lets me focus on the interesting parts and helps me with the drudgery.