Terrific and timely summary, thanks for continuing to do the great work that you do breaking down these models.
I'm curious if you've thought at all about what domains of activities would (or will) work well with the agentic architecture you've described here? My sense is that with coding in particular, it's relatively straightforward to "bind" (or harness!) the agent(s) due to the deterministic nature of coding itself. In contrast, OpenClaw presents a different application and my impression is that it's much less reliable, perhaps because the tasks involved are more open-ended.
There's a philosophical debate happening among AI researchers around whether R&D efforts should aim at "specialized intelligence" versus those who think we need truly general, universal models. (For background: https://arxiv.org/pdf/2602.23643v1) Knowing what you know about these tools, I'm curious where you find yourself in that debate.
Yeah, I think there are more degrees of freedom in OpenClaw, which makes it more chaotic / less reliable.
Besides coding, one other natural application is notetaking. I have a markdown knowledgebase and project planner (been an Obsidian user for many years), and I've been using agents recently to clean and maintain and filter it. Works great!
I had similar thoughts too. My impression is OpenClaw contains more use cases that the harness doesn’t capture yet. My peers and I all have slightly different things we work on in addition to code (GTM, outreach, social media, technical research, finance for a business).
Whenever we build out more of the harness it leads to a jump in productivity. For instance, using a vector store for remembering the content of deep research runs helps more than relying on Openclaws memory system.
Pretty similar to coding harnesses. Thanks Sebastian for the article, it helped clarify things.
Interesting, thanks for sharing. Actually, given that OpenClaw tries to be a jack of all trades, so you know how that's handled during the OpenClaw development? I.e., do the devs run benchmarks against a common set of use cases to make sure that the capabilities improve on several/all of them? E.g. something like adding a JSON schema could make it better for coding contexts but worse at writign social media posts (in terms of tone and humor) or something like that.
At least in my local neighborhood it seems everyone comes up with a benchmark for each important case.
I'll say more about my research example: I have the Openclaw hit Open AI's deep research API and save reports to Obsidian. First behavior I wanted was that it knew how to retrieve at all, then retrieval in say one turn. At some point I wanted it to pull those into context when I had related problems. Now I'm doing even fancier things so it all builds in a cool way.
Oftentimes I'll be in the middle of working and I'll notice issues, so I'll switch to a separate channel and test different things until it works. Access to raw data via Obsidian makes it easy to pull a few more examples.
Bit of a stretch to call it benchmarking! But it does the job quite well.
Obsidian / markdown is actually cool use case. Have been using Obsidian for years now (because of the exportable markdown format & backup purposes), and build quite the knowledge base of the years in there, and it's kind of cool that this format comes in handy now. I.e., sometimes I ask Codex to re-organize and declutter certain note files or project todo lists, which is neat.
There’s a strong parallel here with trading systems people try to build. Everyone obsesses over the model—the signal, the indicator, the “edge.” But in practice, it’s the harness around it that determines whether it actually works.
A good trader already knows this intuitively. The idea isn’t enough. You need structure—rules for execution, memory of past mistakes, constraints on behavior, a way to filter noise and focus only on what matters. Without that, even a strong signal gets diluted by context, hesitation, or overreaction.
What you’re describing here is basically the same thing formalized. The model is the intuition. The harness is the discipline. And most of the real performance comes from how those two interact over time.
That’s why two people can look at the same market, the same level, the same setup—and one extracts consistency while the other churns. The difference isn’t what they see. It’s the system they’ve built around what they see.
A useful addition to session memory is 'flagging' - user-specified or model-inferred flags : this is (likely) going to be important in the future, flag it. Maintain a 'flags' file. Before you run, check flags in case something flagged in the past is relevant for this cycle. Prompt user if necessary for confirmation.
Really liked the context bloat element — currently working on a blog about token compression and related developments so reading how you talk about context was very enriching!
The memory and delegation layer is where most agent implementations fall apart. Everyone focuses on the tool-calling loop, but the real differentiator is how the agent decides what context to carry forward and what to let go. It’s the same challenge humans face managing a project — the bottleneck isn’t doing the work, it’s knowing what matters right now.
I also have the same speculation that the best OSS model with Claude Code harness would do almost as well as the “native” paid version of Claude Code w/ Opus, although I have never systematically evaluated the coding quality differences between these two setups.
This makes me start to wonder if all the coding LLMs will eventually converge to the same quality, and Claude Code harness is free to integrate with the OSS models through 3rd party API providers like MiniMax/GLM but at a ridiculously lower price, what is the real moat of Anthropic?
I agree. At the same time, I think the models also benefit from harness-specific post-training to get the most out of the available tools and operate within the constraints. But yes, I think GLM-5 with a bit of Claude Code-specific post-training would probably be on par with the best Opus model right now in Claude Code.
I guess the moat is that it’s still expensive to train the next model, and it’s also not trivial to serve so many customers more or less reliably at that scale.
1. Context reduction vs. context engineering. The reduction framing (clipping, compression, dedup) follows naturally from using the full message history as the argument-passing mechanism between components. But it is not inevitable. If we start from optimising for context efficiency (which I believe is the core problem in agentic systems) we can imagine other solutions.
Ah yes, there are intertwined topics. Maybe we can say “context engineering” is the main category, and “context reduction” and “context optimization” are the subcategories. And ideally we want to also optimize when reducing.
Hey 👋 Sebastian, great read. I have a request, I have hunted a lot for a book which teaches the TTS/STT , would love to see something from you regarding the same as part of the “from scratch “ series
The harness is everything. I've watched teams ship beautiful models that fail in production because the surrounding system is brittle—no memory across turns, no error recovery, no constraint enforcement.
What interests me: when memory is shallow (context window limits), the harness *has* to do the work of inference. It has to know what's actionable vs. noise, what to cache and what to discard. That's where most teams stumble.
The OpenClaw example makes sense—open-ended tasks expose weak harnesses fast. Coding is naturally bounded (compiler tells you what's wrong). But if your harness can't learn from that feedback, you're stuck.
coding agent harnesses usually don't need RAG (codex and claude code don't afaik). Instead you can just work with the repo info directly and use tools like grep or rg to retrive the relevant info. However, if the current or prior session memory gets too long, it could make sense to add this.
I like to think of this as the systems-centric era. First was model-centric, where progress was thought to be only about model architectures. Then Andrew Ng coined data-centric where, all else being equal, the training data was the differentiator. And now, building on both of those (they’re still important) is the systems-centric era; the supporting systems that bring it all together and unlock these amazing capabilities.
Did you read the Claude Code Source for this? The timing is quite aligned. Hehehe
No comment, haha :)
Terrific and timely summary, thanks for continuing to do the great work that you do breaking down these models.
I'm curious if you've thought at all about what domains of activities would (or will) work well with the agentic architecture you've described here? My sense is that with coding in particular, it's relatively straightforward to "bind" (or harness!) the agent(s) due to the deterministic nature of coding itself. In contrast, OpenClaw presents a different application and my impression is that it's much less reliable, perhaps because the tasks involved are more open-ended.
There's a philosophical debate happening among AI researchers around whether R&D efforts should aim at "specialized intelligence" versus those who think we need truly general, universal models. (For background: https://arxiv.org/pdf/2602.23643v1) Knowing what you know about these tools, I'm curious where you find yourself in that debate.
Yeah, I think there are more degrees of freedom in OpenClaw, which makes it more chaotic / less reliable.
Besides coding, one other natural application is notetaking. I have a markdown knowledgebase and project planner (been an Obsidian user for many years), and I've been using agents recently to clean and maintain and filter it. Works great!
I had similar thoughts too. My impression is OpenClaw contains more use cases that the harness doesn’t capture yet. My peers and I all have slightly different things we work on in addition to code (GTM, outreach, social media, technical research, finance for a business).
Whenever we build out more of the harness it leads to a jump in productivity. For instance, using a vector store for remembering the content of deep research runs helps more than relying on Openclaws memory system.
Pretty similar to coding harnesses. Thanks Sebastian for the article, it helped clarify things.
Interesting, thanks for sharing. Actually, given that OpenClaw tries to be a jack of all trades, so you know how that's handled during the OpenClaw development? I.e., do the devs run benchmarks against a common set of use cases to make sure that the capabilities improve on several/all of them? E.g. something like adding a JSON schema could make it better for coding contexts but worse at writign social media posts (in terms of tone and humor) or something like that.
At least in my local neighborhood it seems everyone comes up with a benchmark for each important case.
I'll say more about my research example: I have the Openclaw hit Open AI's deep research API and save reports to Obsidian. First behavior I wanted was that it knew how to retrieve at all, then retrieval in say one turn. At some point I wanted it to pull those into context when I had related problems. Now I'm doing even fancier things so it all builds in a cool way.
Oftentimes I'll be in the middle of working and I'll notice issues, so I'll switch to a separate channel and test different things until it works. Access to raw data via Obsidian makes it easy to pull a few more examples.
Bit of a stretch to call it benchmarking! But it does the job quite well.
Obsidian / markdown is actually cool use case. Have been using Obsidian for years now (because of the exportable markdown format & backup purposes), and build quite the knowledge base of the years in there, and it's kind of cool that this format comes in handy now. I.e., sometimes I ask Codex to re-organize and declutter certain note files or project todo lists, which is neat.
There’s a strong parallel here with trading systems people try to build. Everyone obsesses over the model—the signal, the indicator, the “edge.” But in practice, it’s the harness around it that determines whether it actually works.
A good trader already knows this intuitively. The idea isn’t enough. You need structure—rules for execution, memory of past mistakes, constraints on behavior, a way to filter noise and focus only on what matters. Without that, even a strong signal gets diluted by context, hesitation, or overreaction.
What you’re describing here is basically the same thing formalized. The model is the intuition. The harness is the discipline. And most of the real performance comes from how those two interact over time.
That’s why two people can look at the same market, the same level, the same setup—and one extracts consistency while the other churns. The difference isn’t what they see. It’s the system they’ve built around what they see.
A useful addition to session memory is 'flagging' - user-specified or model-inferred flags : this is (likely) going to be important in the future, flag it. Maintain a 'flags' file. Before you run, check flags in case something flagged in the past is relevant for this cycle. Prompt user if necessary for confirmation.
Yes, one could perhaps start with a flags .md file to keep track of prios
Yup.
Software leaks memory. Models leak meaning. ;-D
Really liked the context bloat element — currently working on a blog about token compression and related developments so reading how you talk about context was very enriching!
thank you very much, great article!
The memory and delegation layer is where most agent implementations fall apart. Everyone focuses on the tool-calling loop, but the real differentiator is how the agent decides what context to carry forward and what to let go. It’s the same challenge humans face managing a project — the bottleneck isn’t doing the work, it’s knowing what matters right now.
I also have the same speculation that the best OSS model with Claude Code harness would do almost as well as the “native” paid version of Claude Code w/ Opus, although I have never systematically evaluated the coding quality differences between these two setups.
This makes me start to wonder if all the coding LLMs will eventually converge to the same quality, and Claude Code harness is free to integrate with the OSS models through 3rd party API providers like MiniMax/GLM but at a ridiculously lower price, what is the real moat of Anthropic?
I agree. At the same time, I think the models also benefit from harness-specific post-training to get the most out of the available tools and operate within the constraints. But yes, I think GLM-5 with a bit of Claude Code-specific post-training would probably be on par with the best Opus model right now in Claude Code.
I guess the moat is that it’s still expensive to train the next model, and it’s also not trivial to serve so many customers more or less reliably at that scale.
weird times where mistakes could be appreciated as proofs of work,
but I'd vote for feeding it back to a copywriting harness, typos like this break my reading flow a bit:
> handling context bloat beyond just cutting our summarizing information like regular chat UIs.
our→or
Thanks, fixed.
Great taxonomy!
I have two comments:
1. Context reduction vs. context engineering. The reduction framing (clipping, compression, dedup) follows naturally from using the full message history as the argument-passing mechanism between components. But it is not inevitable. If we start from optimising for context efficiency (which I believe is the core problem in agentic systems) we can imagine other solutions.
I have many notes in my llm knowledge base on this - for example: https://zby.github.io/commonplace/notes/session-history-should-not-be-the-default-next-context/
2. Compressing knowledge to fit a budget is different from narrowing interpretation space to increase reliability. I have also many notes on this - I even defined my own vocabulary: https://zby.github.io/commonplace/notes/definitions/distillation/ and https://zby.github.io/commonplace/notes/definitions/constraining/
Ah yes, there are intertwined topics. Maybe we can say “context engineering” is the main category, and “context reduction” and “context optimization” are the subcategories. And ideally we want to also optimize when reducing.
Hey 👋 Sebastian, great read. I have a request, I have hunted a lot for a book which teaches the TTS/STT , would love to see something from you regarding the same as part of the “from scratch “ series
Noted! But I can’t make any promised in terms of timeline. I have quite the list of topics I want to tackle. Thanks for suggesting though!
The harness is everything. I've watched teams ship beautiful models that fail in production because the surrounding system is brittle—no memory across turns, no error recovery, no constraint enforcement.
What interests me: when memory is shallow (context window limits), the harness *has* to do the work of inference. It has to know what's actionable vs. noise, what to cache and what to discard. That's where most teams stumble.
The OpenClaw example makes sense—open-ended tasks expose weak harnesses fast. Coding is naturally bounded (compiler tells you what's wrong). But if your harness can't learn from that feedback, you're stuck.
Thanks for sharing. Have been following your blogs.Loved the way you breakdown concepts , easy to follow 🙂
Where does rag fit into the picture?
coding agent harnesses usually don't need RAG (codex and claude code don't afaik). Instead you can just work with the repo info directly and use tools like grep or rg to retrive the relevant info. However, if the current or prior session memory gets too long, it could make sense to add this.
Thank you Prof!
I like to think of this as the systems-centric era. First was model-centric, where progress was thought to be only about model architectures. Then Andrew Ng coined data-centric where, all else being equal, the training data was the differentiator. And now, building on both of those (they’re still important) is the systems-centric era; the supporting systems that bring it all together and unlock these amazing capabilities.