Components of A Coding Agent

Sebastian Raschka, PhD

Apr 4

How coding agents use tools, memory, and repo context to make LLMs work better in practice

Read →

64 Comments

Aditya Sharan

Apr 4Edited

Did you read the Claude Code Source for this? The timing is quite aligned. Hehehe

Reply (1)

Sebastian Raschka, PhD

Apr 4

No comment, haha :)

chART

Apr 5

There’s a strong parallel here with trading systems people try to build. Everyone obsesses over the model—the signal, the indicator, the “edge.” But in practice, it’s the harness around it that determines whether it actually works.

A good trader already knows this intuitively. The idea isn’t enough. You need structure—rules for execution, memory of past mistakes, constraints on behavior, a way to filter noise and focus only on what matters. Without that, even a strong signal gets diluted by context, hesitation, or overreaction.

What you’re describing here is basically the same thing formalized. The model is the intuition. The harness is the discipline. And most of the real performance comes from how those two interact over time.

That’s why two people can look at the same market, the same level, the same setup—and one extracts consistency while the other churns. The difference isn’t what they see. It’s the system they’ve built around what they see.

Benjamin Riley

Apr 4

Terrific and timely summary, thanks for continuing to do the great work that you do breaking down these models.

I'm curious if you've thought at all about what domains of activities would (or will) work well with the agentic architecture you've described here? My sense is that with coding in particular, it's relatively straightforward to "bind" (or harness!) the agent(s) due to the deterministic nature of coding itself. In contrast, OpenClaw presents a different application and my impression is that it's much less reliable, perhaps because the tasks involved are more open-ended.

There's a philosophical debate happening among AI researchers around whether R&D efforts should aim at "specialized intelligence" versus those who think we need truly general, universal models. (For background: https://arxiv.org/pdf/2602.23643v1) Knowing what you know about these tools, I'm curious where you find yourself in that debate.

Reply (2)

Sebastian Raschka, PhD

Apr 4

Yeah, I think there are more degrees of freedom in OpenClaw, which makes it more chaotic / less reliable.

Besides coding, one other natural application is notetaking. I have a markdown knowledgebase and project planner (been an Obsidian user for many years), and I've been using agents recently to clean and maintain and filter it. Works great!

Had Seddiqi

Apr 8Edited

I had similar thoughts too. My impression is OpenClaw contains more use cases that the harness doesn’t capture yet. My peers and I all have slightly different things we work on in addition to code (GTM, outreach, social media, technical research, finance for a business).

Whenever we build out more of the harness it leads to a jump in productivity. For instance, using a vector store for remembering the content of deep research runs helps more than relying on Openclaws memory system.

Pretty similar to coding harnesses. Thanks Sebastian for the article, it helped clarify things.

Reply (1)

Sebastian Raschka, PhD

Apr 8

Interesting, thanks for sharing. Actually, given that OpenClaw tries to be a jack of all trades, so you know how that's handled during the OpenClaw development? I.e., do the devs run benchmarks against a common set of use cases to make sure that the capabilities improve on several/all of them? E.g. something like adding a JSON schema could make it better for coding contexts but worse at writign social media posts (in terms of tone and humor) or something like that.

Reply (1)

Had Seddiqi

Apr 9

At least in my local neighborhood it seems everyone comes up with a benchmark for each important case.

I'll say more about my research example: I have the Openclaw hit Open AI's deep research API and save reports to Obsidian. First behavior I wanted was that it knew how to retrieve at all, then retrieval in say one turn. At some point I wanted it to pull those into context when I had related problems. Now I'm doing even fancier things so it all builds in a cool way.

Oftentimes I'll be in the middle of working and I'll notice issues, so I'll switch to a separate channel and test different things until it works. Access to raw data via Obsidian makes it easy to pull a few more examples.

Bit of a stretch to call it benchmarking! But it does the job quite well.

Reply (1)

Sebastian Raschka, PhD

Apr 9

Obsidian / markdown is actually cool use case. Have been using Obsidian for years now (because of the exportable markdown format & backup purposes), and build quite the knowledge base of the years in there, and it's kind of cool that this format comes in handy now. I.e., sometimes I ask Codex to re-organize and declutter certain note files or project todo lists, which is neat.

Krishnadasar Sudheer Kumar

Apr 5

I am a fan of your both books on LLMs. Do you have any plans to write a book on Agents and their applications similar to first two books? Just curious.

Reply (1)

Sebastian Raschka, PhD

Apr 5

I might! But first I have to recover a bit from the previous book. It was 1.5 years of hard work, many weekends, etc 😅

Reply (1)

Bin

May 23

+1 to a new book "Code a Coding Agent from Scratch". And then your "code from scratch series" will be my birthday gift i buy myself every year.

Reply (1)

Sebastian Raschka, PhD

May 24

I’ll try 😅🤞

i code for joy

Apr 4

A useful addition to session memory is 'flagging' - user-specified or model-inferred flags : this is (likely) going to be important in the future, flag it. Maintain a 'flags' file. Before you run, check flags in case something flagged in the past is relevant for this cycle. Prompt user if necessary for confirmation.

Reply (1)

Sebastian Raschka, PhD

Apr 5

Yes, one could perhaps start with a flags .md file to keep track of prios

Reply (1)

i code for joy

Apr 5

Yup.

Software leaks memory. Models leak meaning. ;-D

Noah Hirshon

Apr 6

The memory and delegation layer is where most agent implementations fall apart. Everyone focuses on the tool-calling loop, but the real differentiator is how the agent decides what context to carry forward and what to let go. It’s the same challenge humans face managing a project — the bottleneck isn’t doing the work, it’s knowing what matters right now.

Dana Toneva

Apr 20

Really liked the context bloat element — currently working on a blog about token compression and related developments so reading how you talk about context was very enriching!

Arnaud Lamy

Apr 8

thank you very much, great article!

Caron

Apr 6

I also have the same speculation that the best OSS model with Claude Code harness would do almost as well as the “native” paid version of Claude Code w/ Opus, although I have never systematically evaluated the coding quality differences between these two setups.

This makes me start to wonder if all the coding LLMs will eventually converge to the same quality, and Claude Code harness is free to integrate with the OSS models through 3rd party API providers like MiniMax/GLM but at a ridiculously lower price, what is the real moat of Anthropic?

Reply (1)

Sebastian Raschka, PhD

Apr 6

I agree. At the same time, I think the models also benefit from harness-specific post-training to get the most out of the available tools and operate within the constraints. But yes, I think GLM-5 with a bit of Claude Code-specific post-training would probably be on par with the best Opus model right now in Claude Code.

I guess the moat is that it’s still expensive to train the next model, and it’s also not trivial to serve so many customers more or less reliably at that scale.

Martin Vlach

Apr 5

weird times where mistakes could be appreciated as proofs of work,

but I'd vote for feeding it back to a copywriting harness, typos like this break my reading flow a bit:

> handling context bloat beyond just cutting our summarizing information like regular chat UIs.

our→or

Reply (1)

Sebastian Raschka, PhD

Apr 6

Thanks, fixed.

Zbigniew Łukasiak

Apr 5

Great taxonomy!

I have two comments:

1. Context reduction vs. context engineering. The reduction framing (clipping, compression, dedup) follows naturally from using the full message history as the argument-passing mechanism between components. But it is not inevitable. If we start from optimising for context efficiency (which I believe is the core problem in agentic systems) we can imagine other solutions.

I have many notes in my llm knowledge base on this - for example: https://zby.github.io/commonplace/notes/session-history-should-not-be-the-default-next-context/

2. Compressing knowledge to fit a budget is different from narrowing interpretation space to increase reliability. I have also many notes on this - I even defined my own vocabulary: https://zby.github.io/commonplace/notes/definitions/distillation/ and https://zby.github.io/commonplace/notes/definitions/constraining/

Reply (1)

Sebastian Raschka, PhD

Apr 6

Ah yes, there are intertwined topics. Maybe we can say “context engineering” is the main category, and “context reduction” and “context optimization” are the subcategories. And ideally we want to also optimize when reducing.

Sunil Tiwari

Apr 5

Hey 👋 Sebastian, great read. I have a request, I have hunted a lot for a book which teaches the TTS/STT , would love to see something from you regarding the same as part of the “from scratch “ series

Reply (1)

Sebastian Raschka, PhD

Apr 6

Noted! But I can’t make any promised in terms of timeline. I have quite the list of topics I want to tackle. Thanks for suggesting though!

Ronio

Apr 5

The harness is everything. I've watched teams ship beautiful models that fail in production because the surrounding system is brittle—no memory across turns, no error recovery, no constraint enforcement.

What interests me: when memory is shallow (context window limits), the harness *has* to do the work of inference. It has to know what's actionable vs. noise, what to cache and what to discard. That's where most teams stumble.

The OpenClaw example makes sense—open-ended tasks expose weak harnesses fast. Coding is naturally bounded (compiler tells you what's wrong). But if your harness can't learn from that feedback, you're stuck.

Pooja Palod

Apr 5

Thanks for sharing. Have been following your blogs.Loved the way you breakdown concepts , easy to follow 🙂

Federico Nardi

Apr 5

Where does rag fit into the picture?

Reply (1)

Sebastian Raschka, PhD

Apr 5

coding agent harnesses usually don't need RAG (codex and claude code don't afaik). Instead you can just work with the repo info directly and use tools like grep or rg to retrive the relevant info. However, if the current or prior session memory gets too long, it could make sense to add this.

Karthik Rajeshwaran

Apr 5

Thank you Prof!