The prevailing trend on social media is that AGI is right around the corner, but cracks begin to appear as soon as we consider the type of progress we have made and just how far we have to go to create something useful for most people. Let’s examine the evidence. Many researchers base their super-intelligence timelines on the compounding growth in benchmark progress. But benchmarks aren’t real life. OK, then what about AI systems from big (and small labs winning IMO gold medals. Or what about coding agents deployed in production building entire applications in one shot? Surely, these are pretty real right?
Before we dive deeper, let’s level set on definitions. My version of AGI is not human-level consciousness or sci-fi superpower, but simply a practically useful AI agent that can complete junior employee tasks from end-to-end. And even then, the idea that we will start to see such agents replacing entire industries in just 1-2 years feels preposterous. This progress we’ve made so far in verifiable domains is genuinely impressive and already impactful1, but I would argue that these gains will not generalize to other domains. Concretely, practical intelligence cannot be achieved by just turning a few more knobs or optimizing a few more parameters, but instead requires a different approach which embraces uncertainty by explicitly accounting for and dealing with the intrinsic ambiguity of the real world.
Math and coding have inherent advantages that make them easier to solve. To start, pre-training data is readily available with millions of solved problems, documented algorithms, and open-source repositories to draw from. Additionally, these domains offer verifiable rewards: code either runs or it doesn’t, mathematical proofs are either valid or they aren’t. Perhaps most importantly, the audience is forgiving. Developers and mathematicians are willing to fill in the gaps when the model fails because even partial success can be tremendously helpful. Junior programmers exist (for now) despite their constant mistakes because taking a first stab at larger task still ends up saving time and effort that would otherwise have to be spent by someone more senior.
Dealing with the real world is significantly harder. We can see this empirically in that agents outside technical domains consistently underperform their hyped claims. This is due to the latent complexity of the real world which is hard to overcome by simply adding more training data or compute. In other words, the nuances of reality are difficult to capture because it behaves like an iceberg where most details appear beneath the surface, leading to fundamental ambiguity about what to do in any given situation. In theory, this ambiguity can be resolved by providing more data during training or additional context during inference. But in practice, the amount of data and context required is far more than anyone expects.
The problem with all this ambiguity is that it destroys all the advantages we have when buildings agent for coding. For most domains, there aren’t piles of training data just sitting around because unwritten rules and tacit knowledge drive the engine of productivity in human organizations. The blueprint to this social operating system doesn’t exist in any training corpus, which produces ongoing contextual ambiguity. Furthermore, most domains don’t have stable, verifiable answers which can be scalably generated and confirmed. Instead, the correct answer is often time and situation dependent, which we can categorize as temporal ambiguity. Finally, communication ambiguity arises as a result of poor transmission of intent when humans provide instructions. And the average consumer isn’t so forgiving when the AI inevitably makes mistakes due to their own poor communication.
Contextual Ambiguity
Whenever any collection of humans gather together to form a group, a social contract soon appears which governs how members of the organization interact with each other and how they prioritize which activities to pursue. Within the modern workplace, some of this knowledge lives in Slack threads, some in Notion pages, some in email chains, and yet more in other tools. Bringing together these scattered pieces is already challenging, but the real problem is that most tribal knowledge is tacit. The majority of the norms and values of the group remain unwritten, and sometimes are simply unwriteable in nature.
Consider interpersonal dynamics, the kind of common sense that exists on a personal level but never appears in any document. Brian doesn’t want to talk anymore because he’s hungry and lunch just arrived. Sarah seems off because her kid was sick all night, so you postpone difficult conversations. Jacob the CTO is not a morning person, so you never schedule meetings with him before 10am. Colleague moods determine how you should approach them, when to push forward with an idea, when to table a discussion. This information certainly isn’t written down anywhere, yet it’s crucial for effective workplace interaction.
Scale this up to company-level culture and the complexity multiplies. Professional norms that seem obvious are actually learned behaviors that vary dramatically between organizations. Meeting hierarchies differ by company — in some, junior employees speak first to avoid anchoring bias, while in others they defer until seniors have shared their thoughts. “End of day” means 5 PM sharp at law firms but stretches to midnight at startups, or really means “before tomorrow’s 9 AM standup.” Reply All might signal transparency at one company and obnoxiousness at another. The choice of medium (Discord vs. formal email) encodes urgency, hierarchy, and political sensitivity. When the CEO says “interesting idea,” it means “drop everything and work on this,” but that translation appears in no employee handbook.
The water we swim in provides context so pervasive we forget to mention it. Humans acquire these details through micro-interactions accumulated over months and years. We browse company websites absorbing tone and values. We attend all-hands meetings noting not just what was said, but how it was stated. We have lunch conversations, read facial expressions, and observe speaking patterns that surface someone’s real priorities. Even perfect user articulation can’t capture social context that’s so omnipresent people don’t recognize its existence. This invisible infrastructure that humans absorb through osmosis remains opaque to systems trained on text because you can’t post-train on what was never written down.
If we were to take a stab at a solution, a pre-requisite includes multimodal perception combined with persistent memory to build accurate social models. These contexts demand systems that observe meetings through audio, interpret body language through video, and parse unstructured data from PDFs to company memos. We need agents with persistent organizational memories that capture tacit knowledge and make sense of it even when users aren’t actively chatting. The agent must see Pamela’s eye roll at Friday afternoon meetings, hear the CEO’s tone shift when discussing certain topics, and remember that Brian gets hangry at 12:30. Even as we solve for static hidden context, we face further challenges.
Temporal Ambiguity
Ground truth in the real world is a moving target, and a fast one at that. Often the “correct” answer is unknowable because even human experts disagree — should we enter a new market, approve the candidate hire for this role, or add a particular product feature? How my particular company calculates conversion rates differs from yours, and no amount of extended reasoning, test-time compute, or web searches will uncover our specific formula. These correctness of these decisions cannot verified because what works in one case does not apply in another. Reality shifts based on timing, circumstances, and the very act of decision-making itself.
Time destroys solutions in ways static models can’t handle. Market dynamics illustrate this perfectly: the ideal house you identify today is gone before the decision gets approved. Human preferences also evolve constantly: last year’s perfect birthday gift for my nephew (dinosaur encyclopedias) has now become this year’s embarrassment because space is cool now and dinosaurs are for babies. The system must track not just changing preferences but the social dynamics of maturation, which means understanding when interests become “uncool” and why that matters. Consider something as simple as choosing lunch. What I want to eat evolves throughout the week, conditional on what I ate the day before. A successful prediction for Tuesday invalidates itself for Wednesday because I already had that meal. Some predictions are ruined by timing issues, while others create their own invalidity through succeeding.
This self-invalidation problem extends far beyond personal preferences. Strategic decisions alter the landscape they analyze. Netflix’s algorithm correctly identifying demand for true crime documentaries could flood the platform with similar content, changing brand perception and driving away users who valued catalog diversity. Success changes the game itself. Teaching AI to direct Vogue photo shoots illustrates this perfectly. Fashion operates through counter-trend movements. If AI-generated covers become ubiquitous, the cutting edge shifts to deliberately amateur, hand-crafted aesthetics. The “correct” output inverts based on AI prevalence itself. Today’s innovation becomes tomorrow’s cliché precisely by succeeding.
Training on yesterday’s data to solve tomorrow’s problems is like using last year’s map to navigate a city under construction. This volatility means benchmarks measuring static capabilities miss the point entirely because success requires navigating uncertainty, not optimizing accuracy. An AI might correctly analyze market entry profitability, but the announcement triggers competitive responses that invalidate every calculation. Since there can never be a single “correct” answer to many real-world questions, writing a verifier is impossible.
What we’re dealing with is an ever-changing environment whereby the actions of the actor also influence the state of the world. The shift from a multi-arm bandit into reinforcement learning precisely captures this changing paradigm. To put it more plainly, this complexity is something we’re already used to dealing with in machine learning. The gap in performance appears though because true adaptability requires continual learning that updates understanding in real-time, transcending off-policy rewards for dynamic evolution. The system must fundamentally reconceptualize its approach as contexts shift. Adaptation means not just updating weights but developing new strategies that recognize when the rules of the game have changed.
Communication Ambiguity
Perhaps the biggest hindrance to general purpose agents isn’t a lack of technology, but rather a contradiction in how humans naturally operate. Whereas instructing a model requires spending time carefully crafting the perfect prompt, most humans won’t adapt their speaking style for friends or colleagues, much less for random machines they have never interacted with before. Despite what AI enthusiasts claim, prompt engineering is not natural language — it’s engineering. Natural language evolved for humans who share massive amounts of context which creates fundamental uncertainty for systems that don’t.
This becomes an insurmountable barrier because humans are unwilling to change. Human inflexibility and stubbornness surrounds us every day. After 40 years of keyboards being ubiquitous, millions still hunt and peck with two fingers. Behavioral change at scale is nearly impossible. People regularly fall for phishing emails despite mandatory security training that explicitly warns against them. Office workers don’t know keyboard shortcuts like Ctrl+C, instead navigating through right-click menus for every copy-paste operation. Some print emails to read them. Others write passwords on sticky notes. If we can’t get people to learn basic computer skills after decades, expecting them to master prompt engineering is a non-starter.
But the problem runs deeper than unwillingness. Sometimes, users literally can’t provide the right information for agents to succeed. Even motivated users fail because they don’t know which details are relevant. When someone says “make it pop” in a design review, those three words carry entirely different meanings at a minimalist tech startup versus a children’s toy company. At the startup, it might mean subtle animation or increased white space. At the toy company, it means rainbow colors and cartoon characters. A non-designer lacks the vocabulary to articulate what they actually want.
This is the “unknown unknowns” problem that emerges whenever we expand into new domains. When asked to “run the numbers on this campaign,” a human expert intuitively knows whether to focus on ROI, engagement rates, or customer acquisition cost based on who’s asking and what meeting it’s for. An AI agent receiving the same request has no such context. Furthermore, natural language itself is lossy compression that assumes shared understanding. This is why we have so many words for similar concepts—ambiguity is baked into human communication.
Human nature represents the ultimate showstopper for AGI in the near future. Expecting average users to change their communication habits is futile. The gap between technical and non-technical domains creates asymmetric expectations that doom general-purpose agents. Unlike coding environments where partial solutions have value, real-world tasks often have binary success criteria. The email is sent to the right user or its not. The numbers presented in the dashboard are either correct or they’re not. This fundamental difference in forgiveness means that the successes in technical domains won’t translate to general use.
The only viable solution requires agents that proactively extract missing context through intelligent questioning. An agent must know to ask “Is this for the quarterly board review or the team standup?” without being prompted, and also to know when to abstain from asking to avoid becoming annoying. Consequently, we need agents who exhibit agency towards figuring out what information is missing to successfully complete a task through explicit dialogue states rather than probabilistic guesses. When we imbue agents with built-in priors about common scenarios, we enable proactive clarification rather than guessing or failing silently because now the agent can go into a situation with a point of view on what to expect. Building such systems requires a different mindset that embraces the human-in-the-loop rather than seeing it as a blocker to scale.
Achieving Useful AI Requires a Pivot
While progress is possible on all the problems above, we won’t get there without fundamentally changing our approach — mainly a recognition that there are still substantial missing pieces to the puzzle which require attention. We move closer to success as we integrate multiple capabilities that in total can address the three massive unsolved challenges. First, proactive clarification that goes beyond answering what’s asked to understanding what should be asked. Agents must know to ask “Is this presentation internal or client-facing?” unprompted, reconstructing context from minimal cues. Second, multi-modal perception with persistent memory capturing full work context. Systems must attend meetings, interpret body language, remember interactions, and model organizational culture over time. Third, RL-based agents must exhibit continual learning that enables real-time adaptation to shifting circumstances and changing contexts.
The integration challenge is that we need all three solutions working together, not separately. For AI agents to work, most enterprises will require a bridge between the AI and their specific workflows. This means tooling to connect different systems, pulling in the right amount of data rather than all of it, handling security and permissions properly for software development, and having deep domain knowledge tied to specific use cases. The last mile of making AI agents work in real, highly variable and hostile environments is incredibly hard, and it’s increasingly the most valuable part of the whole process. Going beyond technology, we must consider customer support tailored to each domain, service-level agreements, liability clauses, customized sales motions, and aligned partnerships. And everyone one of these steps faces uncertainty in an ever-changing environment that isn’t captured by benchmarks.
The path forward requires us to shift our focus from climbing leaderboards to navigating ambiguity. Instead of celebrating another percentage point on standardized tests, we should be building systems that can handle the messy, implicit, and frankly brokenness nature of human work. Until we acknowledge that there are always more unknown details in reality, we will fail to address this complexity, and AGI will remain a distant dream rather than an imminent reality. Progress is possible though when we realize that our goal is not to solve all tasks, but just to make helpful progress on specific tasks. Ultimately, real breakthroughs will come from those willing to grapple with the full complexity of the challenge.
Thanks for reading More Than One Turn! Subscribe for free to receive new posts and support my work.
-
Who among us isn’t using AI assistance for coding? Personally, I hit up against credit limits at least once a week. ↩