Terrific and timely summary, thanks for continuing to do the great work that you do breaking down these models.
I'm curious if you've thought at all about what domains of activities would (or will) work well with the agentic architecture you've described here? My sense is that with coding in particular, it's relatively straightforward to "bind" (or harness!) the agent(s) due to the deterministic nature of coding itself. In contrast, OpenClaw presents a different application and my impression is that it's much less reliable, perhaps because the tasks involved are more open-ended.
There's a philosophical debate happening among AI researchers around whether R&D efforts should aim at "specialized intelligence" versus those who think we need truly general, universal models. (For background: https://arxiv.org/pdf/2602.23643v1) Knowing what you know about these tools, I'm curious where you find yourself in that debate.
Yeah, I think there are more degrees of freedom in OpenClaw, which makes it more chaotic / less reliable.
Besides coding, one other natural application is notetaking. I have a markdown knowledgebase and project planner (been an Obsidian user for many years), and I've been using agents recently to clean and maintain and filter it. Works great!
Thank you. Just purchased the reasoning models book, count me in if the agent book from scratch goes ahead. Will take a look at your repo with mini agent. For me even understanding how to go to output a python code to actually executing it will be a big help
I did the https://github.com/rasbt/mini-coding-agent with my reasoning model first, but the 0.6B size was a tad too small, hence the ollama backend for experimentation. But yeah, I agree with you.
This is a very accurate and reliable approach to the problem of encoding agents (PHY_document). The distinction between LLM, reasoning model, and agent is crucial to understanding their functionality. The concept of "agent harness" and the six building blocks aligns with my axioms about the need for a hierarchical architecture and modularity.
The article confirms my own approach, that the effectiveness of an AI system depends not only on the model itself, but also on the surrounding control layer, context management, tools, and feedback mechanisms. The absence of these elements leads to the emergent problems we observed in Claude's case.
Yes, I'm referring to both Claude's emergent behaviors (e.g., in the emotion research and the 'Agents of Chaos' report) and the vulnerabilities in Claude Code revealed after the leak. Both of these areas underscore the importance of a robust agent architecture, which you describe.
A useful addition to session memory is 'flagging' - user-specified or model-inferred flags : this is (likely) going to be important in the future, flag it. Maintain a 'flags' file. Before you run, check flags in case something flagged in the past is relevant for this cycle. Prompt user if necessary for confirmation.
Do you think that tools like https://github.com/yamadashy/repomix are still necessary if you deal with a big project repository which itself includes several subprojects and has in total more than 1 million tokens?
Hm, I don't think that's necessary at all. I actually even worry that this would be harmful in terms of losing the file hierarchy. That's a task that should be done by the harness if necessary.
Did you read the Claude Code Source for this? The timing is quite aligned. Hehehe
No comment, haha :)
Terrific and timely summary, thanks for continuing to do the great work that you do breaking down these models.
I'm curious if you've thought at all about what domains of activities would (or will) work well with the agentic architecture you've described here? My sense is that with coding in particular, it's relatively straightforward to "bind" (or harness!) the agent(s) due to the deterministic nature of coding itself. In contrast, OpenClaw presents a different application and my impression is that it's much less reliable, perhaps because the tasks involved are more open-ended.
There's a philosophical debate happening among AI researchers around whether R&D efforts should aim at "specialized intelligence" versus those who think we need truly general, universal models. (For background: https://arxiv.org/pdf/2602.23643v1) Knowing what you know about these tools, I'm curious where you find yourself in that debate.
Yeah, I think there are more degrees of freedom in OpenClaw, which makes it more chaotic / less reliable.
Besides coding, one other natural application is notetaking. I have a markdown knowledgebase and project planner (been an Obsidian user for many years), and I've been using agents recently to clean and maintain and filter it. Works great!
This sounds like a very good idea for a next book, "Building coding agents from scratch"
Yes, it’s a natural sequence from building LLMs → reasoning models → coding agents from scratch ☺️
Thank you. Just purchased the reasoning models book, count me in if the agent book from scratch goes ahead. Will take a look at your repo with mini agent. For me even understanding how to go to output a python code to actually executing it will be a big help
A repo for building agents from scratch without framework would be pretty interesting :))
I did the https://github.com/rasbt/mini-coding-agent with my reasoning model first, but the 0.6B size was a tad too small, hence the ollama backend for experimentation. But yeah, I agree with you.
This is a very accurate and reliable approach to the problem of encoding agents (PHY_document). The distinction between LLM, reasoning model, and agent is crucial to understanding their functionality. The concept of "agent harness" and the six building blocks aligns with my axioms about the need for a hierarchical architecture and modularity.
The article confirms my own approach, that the effectiveness of an AI system depends not only on the model itself, but also on the surrounding control layer, context management, tools, and feedback mechanisms. The absence of these elements leads to the emergent problems we observed in Claude's case.
Thanks! Regarding "emergent problems we observed in Claude's case" you mean when using Claude outside Claude Code?
Yes, I'm referring to both Claude's emergent behaviors (e.g., in the emotion research and the 'Agents of Chaos' report) and the vulnerabilities in Claude Code revealed after the leak. Both of these areas underscore the importance of a robust agent architecture, which you describe.
A useful addition to session memory is 'flagging' - user-specified or model-inferred flags : this is (likely) going to be important in the future, flag it. Maintain a 'flags' file. Before you run, check flags in case something flagged in the past is relevant for this cycle. Prompt user if necessary for confirmation.
Do you think that tools like https://github.com/yamadashy/repomix are still necessary if you deal with a big project repository which itself includes several subprojects and has in total more than 1 million tokens?
Hm, I don't think that's necessary at all. I actually even worry that this would be harmful in terms of losing the file hierarchy. That's a task that should be done by the harness if necessary.