Every company, big or small, currently has a mandate to lean into the AI future and provide some offering around agentic AI. Stripped down to the essentials though, an agent is just a while loop with some tool calls. It can’t be that complicated, right? Given the simplicity, one might expect these agent offerings to be relatively homogenous, sharing roughly the same abstractions under different brand names. After reading through the source code of a dozen-plus agent SDKs, this turns out not to be the case. Every company carries along unique biases and incentives, leading to a proliferation of different philosophies when designing an AI Agent. Choosing the right SDK for your own purposes actually comes down to asking, “How AGI-pilled have you become?”
A modern LLM recently disproved an Erdős conjecture which specialists previously had been unable to crack despite decades of human effort. Mathematicians around the world sat in awe as their careers flashed before their eyes, wondering if their job might be the next one to be taken down by AI. Interestingly, that same model still struggles when asked how many times the letter r appears in the word “strawberries” – sometimes telling you there are ‘two’ with full confidence. Models which have been RLHF’ed to death have learned to always say ‘three’. In that case, you can ask for the number of rs in ‘blueberry’, and it will continue to (incorrectly) return ‘three’. Training with more parameters, more data, or more layers does not fix this. If anything, the failure mode tends to become worse as the souped up model simply wraps more elegant prose around the answer without actually changing its stance.
AI agents live on a jagged frontier where some days entire features are built while you’re on a coffee break, and other days they spin for an hour only to confidently delete a file you explicitly told not to touch. Furthermore, modern agents are composed of many parts, each one expanding the surface area of compounding failure. While a 95%-reliable component sounds fine, chaining ten of them together results in something closer to a coin flip than a finished product. Given such probabilistic components, is it possible to increase the trustworthiness of an agent when each part is so decidedly untrustworthy?
The market for AI agent toolkits in 2026 is booming. LangGraph, CrewAI, Agno, Claude Agent SDK, Google ADK: the list just keeps growing. Whether you want to call it an agent framework, an orchestration engine, or agent harness, they all share a common goal: to make it easy for developers to build agents. They compete on aspects such as the number of tool integrations, how to manage guardrails against prompt injection, and time-to-first-demo. The engineering is undoubtedly impressive. But what about optimizing for the end user? After a deep code review of the leading frameworks, I found a gap that none of them address. What happens when facing requests that are vague, incomplete, or just plain confusing?
The prevailing trend on social media is that AGI is right around the corner, but cracks begin to appear as soon as we consider the type of progress we have made and just how far we have to go to create something useful for most people. Let’s examine the evidence. Many researchers base their super-intelligence timelines on the compounding growth in benchmark progress. But benchmarks aren’t real life. OK, then what about AI systems from big (and small labs winning IMO gold medals. Or what about coding agents deployed in production building entire applications in one shot? Surely, these are pretty real right?