LLMs and core software: human driven

It’s clear LLMs are one of the biggest changes in technology ever. The rate of progress is astounding: recently due to a configuration mistake I accidentally used Claude Sonnet 3.5 (released ~2 years ago) instead of Opus 4.6 for a task and looked at the output and thought “what is this garbage”?

But daily now: Opus 4.6 is able to generate reasonable PoC level Rust code for complex tasks for me. It’s not perfect – it’s a combination of exhausting and exhilarating to find the 10% absolutely bonkers/broken code that still makes it past subagents.

So yes I use LLMs every day, but I will be clear: if I could push a button to “un-invent” them I absolutely would because I think the long term issues in larger society (not being able to trust any media, and many of the things from Dario’s recent blog etc.) will outweigh the benefits.

But since we can’t un-invent them: here’s my opinion on how they should be used. As a baseline, I agree with a lot from this doc from Oxide about LLMs. What I want to talk about is especially around some of the norms/tools that I see as important for LLM use, following principles similar to those.

On framing: there’s “core” software vs “bespoke”. An entirely new capability of course is for e.g. a nontechnical restaurant owner to use an LLM to generate (“vibe code”) a website (excepting hopefully online orderings and payments!). I’m not overly concerned about this.

Whereas “core” software is what organizations/businesses provide/maintain for others. I work for a company (Red Hat) that produces a lot of this. I am sure no one would want to run for real an operating system, cluster filesystem, web browser, monitoring system etc. that was primarily “vibe coded”.

And while I respect people and groups that are trying to entirely ban LLM use, I don’t think that’s viable for at least my space.

Hence the subject of this blog is my perspective on how LLMs should be used for “core” software: not vibe coding, but using LLMs responsibly and intelligently – and always under human control and review.

Agents should amplify and be controlled by humans

I think most of the industry would agree we can’t give responsibility to LLMs. That means they must be overseen by humans. If they’re overseen by a human, then I think they should be amplifying what that human thinks/does as a baseline – intersected with the constraints of the task of course.

On “amplification”: Everyone using a LLM to generate content should inject their own system prompt (e.g. AGENTS.md) or equivalent. Here’s mine – notice I turn off all the emoji etc. and try hard to tune down bulleted lists because that’s not my style. This is a truly baseline thing to do.

Now most LLM generated content targeted for core software is still going to need review, but just ensuring that the baseline matches what the human does helps ensure alignment.

Pull request reviews

Let’s focus on a very classic problem: pull request reviews. Many projects have wired up a flow such that when a PR comes in, it gets reviewed by a model automatically. Many projects and tools pitch this. We use one on some of my projects.

But I want to get away from this because in my experience these reviews are a combination of:

  • Extremely insightful and correct things (there’s some amazing fine-tuning and tool use that must have happened to find some issues pointed out by some of these)
  • Annoying nitpicks that no one cares about (not handling spaces in a filename in a shell script used for tests)
  • Broken stuff like getting confused by things that happened after its training cutoff (e.g. Gemini especially seems to get confused by referencing the current date, and also is unaware of newer Rust features, etc)

In practice, we just want the first of course.

How I think it should work:

  • A pull request comes in
  • It gets auto-assigned to a human on the team for review
  • A human contributing to that project is running their own agents (wherever: could be local or in the cloud) using their own configuration (but of course still honoring the project’s default development setup and the project’s AGENTS.md etc)
  • A new containerized/sandboxed agent may be spawned automatically, or perhaps the human needs to click a button to do so – or perhaps the human sees the PR come in and thinks “this one needs a deeper review, didn’t we hit a perf issue with the database before?” and adds that to a prompt for the agent.
  • The agent prepares a draft review that only the human can see.
  • The human reviews/edits the draft PR review, and has the opportunity to remove confabulations, add their own content etc. And to send the agent back to look more closely at some code (i.e. this part can be a loop)
  • When the human is happy they click the “submit review” button.
  • Goal: it is 100% clear what parts are LLM generated vs human generated for the reader.

I wrote this agent skill to try to make this work well, and if you search you can see it in action in a few places, though I haven’t truly tried to scale this up.

I think the above matches the vision of LLMs amplifying humans.

Code Generation

There’s no doubt that LLMs can be amazing code generators, and I use them every day for that. But for any “core” software I work on, I absolutely review all of the output – not just superficially, and changes to core algorithms very closely.

At least in my experience the reality is still there’s that percentage of the time when the agent decided to reimplement base64 encoding for no reason, or disable the tests claiming “the environment didn’t support it” etc.

And to me it’s still a baseline for “core” software to require another human review to merge (per above!) with their own customized LLM assisting them (ideally a different model, etc).

FOSS vs closed

Of course, my position here is biased a bit by working on FOSS – I still very much believe in that, and working in a FOSS context can be quite different than working in a “closed environment” where a company/organization may reasonably want to (and be able to) apply uniform rules across a codebase.

While for sure LLMs allow organizations to create their own Linux kernel filesystems or bespoke Kubernetes forks or virtual machine runtime or whatever – it’s not clear to me that it is a good idea for most to do so. I think shared (FOSS) infrastructure that is productized by various companies, provided as a service and maintained by human experts in that problem domain still makes sense. And how we develop that matters a lot.

Thoughts on agentic AI coding as of Oct 2025

Sandboxed, reviewed parallel agents make sense

For coding and software engineering, I’ve used and experimented with various frontends (FOSS and proprietary) to multiple foundation models (mostly proprietary) trying to keep up with the state of the art. I’ve come to strongly believe in a few things:

  • Agentic AI for coding needs strongly sandboxed, reproducible environments
  • It makes sense to run multiple agents at once
  • AI output definitely needs human review

Why human review is necessary

Prompt injection is a serious risk at scale

All AI is at risk of prompt injection to some degree, but it’s particularly dangerous with agentic coding. All the state of the art today knows how to do is mitigate it at best. I don’t think it’s a reason to avoid AI, but it’s one of the top reasons to use AI thoughtfully and carefully for products that have any level of criticality.

OpenAI’s Codex documentation has a simple and good example of this.

Disabling the tests and claiming success

Beyond that, I’ve experienced multiple times different models happily disabling the tests or adding a println!("TODO add testing here") and claim success. At least this one is easier to mitigate with a second agent doing code review before it gets to human review.

Sandboxing

The “can I do X” prompting model that various interfaces default to is seriously flawed. Anthropic has a recent blog post on Claude Code changes in this area.

My take here is that sandboxing is only part of the problem; the other part is ensuring the agent has a reproducible environment, and especially one that can be run in IaaS environments. I think devcontainers are a good fit.

I don’t agree with the statement from Anthropic’s blog

without the overhead of spinning up and managing a container.

I don’t think this is overhead for most projects because Where it feels like it has overhead, we should be working to mitigate it.

Running code as separate login users

In fact, one thing I think we should popularize more on Linux is the concept of running multiple unprivileged login users. Personally for the tasks I work on, it often involves building containers or launching local VMs, and isolating that works really well with a full separate “user” identity. An experiment I did was basically useradd ai and running delegated tasks there instead. To log in I added %wheel ALL=NOPASSWD: /usr/bin/machinectl shell ai@ to /etc/sudoers.d/ai-login so that my regular human user could easily get a shell in the ai user’s context.

I haven’t truly “operationalized” this one as juggling separate git repository clones was a bit painful, but I think I could automate it more. I’m interested in hearing from folks who are doing something similar.

Parallel, IaaS-ready agents…with review

I’m today often running 2-3 agents in parallel on different tasks (with different levels of success, but that’s its own story).

It makes total sense to support delegating some of these agents to work off my local system and into cloud infrastructure.

In looking around in this space, there’s quite a lot of stuff. One of them is Ona (formerly Gitpod). I gave it a quick try and I like where they’re going, but more on this below.

Github Copilot can also do something similar to this, but what I don’t like about it is that it pushes a model where all of one’s interaction is in the PR. That’s going to be seriously noisy for some repositories, and interaction with LLMs can feel too “personal” sometimes to have permanently recorded.

Credentials should be on demand and fine grained for tasks

To me a huge flaw with Ona and one shared with other things like Langchain Open-SWE is basically this:

Sorry but: no way I’m clicking OK on that button. I need a strong and clearly delineated barrier between tooling/AI agents acting “as me” and my ability to approve and push code or even do basic things like edit existing pull requests.

Github’s Copilot gets this more right because its bot runs as a distinct identity. I haven’t dug into what it’s authorized to do. I may play with it more, but I also want to use agents outside of Github and I also am not a fan of deepening dependence on a single proprietary forge either.

So I think a key thing agent frontends should help do here is in granting fine-grained ephemeral credentials for dedicated write access as an agent is working on a task. This “credential handling” should be a clearly distinct component. (This goes beyond just git forges of course but also other issue trackers or data sources that may be in context).

Conclusion

There’s so much out there on this, I can barely keep track while trying to do my real job. I’m sure I’m not alone – but I’m interested in other’s thoughts on this!