AI Chat Models wowed us. Now we see their limits. The next step leaves language behind ...

Chris

March 29, 2026|Artificial Intelligence

Beyond the Word: What Comes After Language Models | 4 Billion Years On

The tools most people think of as AI today, ChatGPT, Gemini, Claude, are language models. Extraordinarily capable ones. But there is a ceiling. The next phase of AI is not about predicting the next word. It is about understanding the world the word is trying to describe.

ChatGPT, Google Gemini, Claude. All three are Large Language Models, LLMs. Trained on vast quantities of text, learning to predict what word comes next in a sequence with extraordinary accuracy. They have been further refined through reinforcement learning from human feedback, which shapes how they respond and gives each one its particular character. But the engine underneath all three is the same thing: a statistical pattern-matcher operating on language. A next-word predictor, running at enormous scale.

They are, as Demis Hassabis, CEO of Google DeepMind and Nobel laureate, has put it, "almost unreasonably effective for what they are." That framing is precise. These systems emerged from a simple mechanism and turned out to possess capabilities nobody had explicitly built in: abstraction, conceptual reasoning, the ability to navigate unfamiliar domains. Hassabis has said he would not have predicted that five years ago.

But he has also been clear about what they cannot do. And this is where the story gets interesting.

Language is not where intelligence lives

Anatomically modern humans have been around for roughly 300,000 years. Written language appeared about 5,000 years ago. That means our species spent 295,000 years building intelligence, cooperation, tools, social structure, and a detailed understanding of the physical world, entirely without it.

What happened when language did arrive is remarkable. Knowledge could be recorded. Transmitted across time. Built upon by people who would never meet. No single person can design a nuclear power station. But the human race can, because understanding accumulates and compounds across generations through the written word. Language did not create human intelligence. It multiplied it beyond the individual.

Today's AI was trained on that inheritance. All of it. The entire sediment of recorded human thought. And the result is genuinely impressive. But there is a problem embedded in the approach that no amount of additional text training will fix.

The problem is this: language is not where intelligence actually lives. It is where intelligence gets narrated.

System 1 does not speak

Cognitive psychologists describe two systems of thinking. System 1 is fast, automatic, and operates largely below conscious awareness. System 2 is slow, deliberate, and strongly tied to language. When you catch a glass before it falls, read a room the moment you walk in, or feel that something is wrong before you can say what, that is System 1. When you work through a problem on paper, that is System 2.

System 1 does not operate in language. It operates in something far older: spatial pattern recognition, sensorimotor memory, physical intuition built from a lifetime of lived experience. Its substrate is the amygdala and basal ganglia, structures that predate language by hundreds of millions of years. We share them with fish.

The inner voice most of us experience as conscious thought is not the source of this intelligence. Research suggests it accounts for around half of our waking mental experience at most. The rest is imagery, spatial sense, emotion, and what neuroscientists call unsymbolised thinking: knowing something clearly without any word carrying it. People who lose inner speech through stroke or injury report that they still think. Their intentions, spatial awareness, and emotional responses remain. What they lose is the ability to narrate verbally. The experience continues without the caption.

The inner voice, in other words, is our best language-based interpretation of a processing system that does not run in language at all.

LLMs were trained on the captions. Not on what the captions are describing.

Human intelligence has always had a non-linguistic spatial layer beneath language. LLMs mirror only the verbal surface. World models are the first serious attempt to build what is missing.

What LLMs cannot do, and why more text will not fix it

Consider gravity.

An LLM knows an enormous amount about it. Newton, Einstein, orbital mechanics, tidal forces, gravitational lensing. It can discuss all of this with confidence and accuracy. But it has never watched an apple fall. It has no internal sense of weight or momentum. Its knowledge of gravity is entirely second-hand, a pattern of language about gravity, not a concept of gravity built from watching the world behave.

This is why LLMs hallucinate. They generate plausible-sounding language without anything to anchor it to. There is no ground truth underneath, only more language. And beyond a certain level of abstraction, the errors do not shrink with more training data. They compound.

Yann LeCun, who left Meta in 2025 to found his own research institute in Paris, has been direct: "LLMs are too limiting. Scaling them up will not allow us to reach AGI." Hassabis, who disagrees with LeCun on most things, has described better language models as "necessary but probably insufficient" for artificial general intelligence. Two people who rarely agree, pointing at the same ceiling.

The question is what sits above it.

AlphaGo and AlphaFold were never language models

DeepMind had already demonstrated something important, years before world models became a mainstream topic.

AlphaGo, which defeated world champion Lee Sedol in 2016 with a move so unexpected that professional commentators assumed it was a mistake, learned by observing the spatial geometry of the board. It played millions of games against itself, reinforcing strategies its internal model told it were likely to win. Move 37, the move that stunned everyone, emerged from spatial reasoning that had gone somewhere human intuition had never been. There was no language anywhere in that process.

AlphaFold, for which Hassabis and John Jumper were awarded the Nobel Prize in Chemistry in 2024, solved protein folding by reasoning about three-dimensional geometry. Its architecture allowed evolutionary relationships and spatial relationships to inform each other iteratively, refining predictions about how amino acids are positioned in physical space. It predicted the structures of 200 million proteins in roughly the time it had previously taken to determine a handful. No language. The problem was spatial. The solution was spatial.

Both systems are proof of something the LLM conversation tends to obscure: some of the most important problems we face are not language problems. They are geometry problems, physics problems, problems of spatial inference. And the architectures that solve them look nothing like the ones behind ChatGPT.

"We think the combination of Gemini's world models, AlphaGo's search and planning techniques, and specialised AI tool use will prove to be critical for AGI."
Demis Hassabis, Google DeepMind, 2026

Language multiplied human intelligence. Physical understanding will do the same for AI.

Before language, human intelligence existed. It was real, capable, and spatially sophisticated. What language did was allow individual understanding to accumulate. To be shared, stored, built upon by people across time and geography who would never meet.

That accumulation is what LLMs were trained on. And it is extraordinary. But it is approaching its ceiling.

World models represent the attempt to build the layer that existed before language: physical intuition, spatial reasoning, causal understanding grounded in observation rather than description. If that succeeds, it gives the language capability something real to be about.

There is a further implication worth sitting with. An AI that has built physical intuitions from observation might eventually notice that there is no written equation that fully explains gravity. That each model we have breaks down at some edge: Newton at relativistic speeds, Einstein at quantum scales. An LLM trained on descriptions of these models is far less likely to sense those edges. An AI that has learned physics by watching the world behave might be better placed to find them, and to reason beyond them.

Who is building this now

AMI Labs
(Yann LeCun)

LeCun left Meta in late 2025 to found Advanced Machine Intelligence Labs in Paris, raising over €1 billion at a €3.5 billion valuation. His JEPA architecture family learns spatial representations from visual data without explicit labels, building the kind of physical intuition that children develop through observation rather than instruction.

World Labs
(Fei-Fei Li)

Founded by the researcher behind ImageNet, World Labs raised $1 billion in 2026 at a $5 billion valuation. Their Marble product generates physically coherent interactive 3D environments from a single image or text prompt, targeting robotics, simulation, gaming, and design.

NVIDIA Cosmos

Trained on 20 million hours of real-world footage, launched at CES 2025. Downloaded over 2 million times by early 2026, primarily for robotics and autonomous vehicle pipelines where physical plausibility matters more than verbal fluency.

Google DeepMind
Genie 3

Released August 2025. Generates environments that respond to inputs with physical plausibility in real time. Used for training AI agents in simulation without needing to collect real-world data at scale.

Meta V-JEPA 2

A self-supervised video world model that transfers to zero-shot robotic planning with minimal action data. Learns physical causality from raw footage rather than from annotated examples.

The continuous learning problem

There is a second problem, separate from the spatial one, and equally fundamental.

Today's LLMs are trained in runs lasting months, at a cost of hundreds of millions of dollars. The result is a static model. A frozen snapshot. If the world changes, if new knowledge emerges, if your situation evolves, the model does not adapt. There is no mechanism for continuous incremental update. Incorporating new learning means another enormous training run.

Your brain does not work this way. It consolidates, prunes, and integrates experience continuously, every day, without requiring you to go offline for six months. Every interaction shifts your model of the world, imperceptibly but persistently.

There are partial workarounds. Retrieval-augmented generation (RAG) lets a model pull from an external, updateable knowledge store. Fine-tuning adjusts specific behaviours without full retraining. Federated learning allows models to update from distributed data without centralising it. But none of these is genuine continuous learning. The core model stays static. The adaptations are bolted on from outside.

Current LLMs are static once trained. Workarounds exist but none solve the underlying problem. Genuine continuous learning, where weights update from real-time experience, remains the frontier.

Can a remote AI teach a local one?

This is one of the most consequential open questions in the field right now, and it is not getting enough attention.

When a powerful cloud-based model helps a smaller local model solve a problem, can that solution become part of the local model? Not stored as a note. Actually embedded in the network, as new capability?

The technique that comes closest is knowledge distillation, introduced by Geoffrey Hinton in 2015. A large teacher model trains a smaller student model, not by transferring hard facts, but by sharing its probabilistic output distributions. A teacher that is 90% confident about an answer and 8% uncertain in a related direction conveys far more information than a simple right/wrong label. The student learns something of how the teacher weighs possibilities, not just what it concludes.

DeepSeek's R1 models demonstrated this at scale in 2025. Chain-of-thought reasoning from a 671-billion parameter model was successfully transferred into models as small as 1.5 billion parameters, which then outperformed much larger competitors on mathematical benchmarks. That is a meaningful result. A small model absorbing reasoning capability from a large one, and carrying it independently.

More recent work goes further. Intermediate layer distillation transfers knowledge not just from final outputs but from internal representations, giving the student a richer sense of how the teacher structures its understanding. Online distillation runs teacher and student simultaneously, the student tracking the teacher in near-real-time.

What does not yet exist is a clean runtime loop. A local edge model, having consulted a remote cloud model to solve a novel problem, automatically incorporating that solution into its own weights, without a full retraining cycle. The closest thing is incremental fine-tuning on a small batch of examples derived from the remote interaction. It is still slow. It still risks catastrophic forgetting, where new learning overwrites old. And it bears little resemblance to what a biological brain does when it sleeps.

The dream is a local model that genuinely absorbs what the cloud taught it. Not as a stored note. As a new embedded capacity. We are not there yet. The direction is clear.

Two layers

The architecture that seems to be assembling itself, across many different research programmes simultaneously, is a two-layer system.

A local model, running on your device. Small, fast, efficient, built around you specifically: your routines, your preferences, your patterns of behaviour. It runs continuously, adapts to your context, handles the routine, and does most of this without a cloud round-trip. This is the subconscious layer.

A remote model, running in the cloud. Large, capable, expensive to query. It gets called when the local model hits its edge, when a problem is novel enough or complex enough to require deeper reasoning. This is the conscious layer.

The parallel to System 1 and System 2 is not accidental. System 1 runs below awareness, continuously, efficiently, routing the enormous volume of ordinary experience without troubling conscious attention. System 2 is recruited for the things System 1 cannot handle. The cognitive load spikes, the reasoning slows, the processing goes serial. That structure, repeated at scale, is what is now being built into personal AI.

Every major smartphone chipset already includes a Neural Processing Unit built for exactly this. Apple, Google, Qualcomm, and ARM are all building silicon for on-device inference. Formatting, summarisation, personalised suggestions, routine decisions: these are increasingly running locally, at millisecond speeds, without touching a server. The frontier reasoning still happens in the cloud. But the split is real and deepening.

Human dual-process cognition and the emerging two-layer AI architecture are structurally analogous. The knowledge distillation arrow, from cloud to device, is the open research problem. Who owns what the local layer learns is the open political one.

Who owns what the local layer learns?

That local model has learned you. Your patterns, your preferences, your behaviour over time. It is, potentially, the most commercially valuable and personally sensitive thing in the system.

If it lives on your device, do you own it? If it was trained with assistance from a cloud model, does the provider have a claim? If it continuously uploads summaries of your behaviour to refine a remote model, are you the product being consumed?

These questions are not settled. The answers will differ by jurisdiction, and by how aggressively companies push their terms of service. Historically, the entities that control the accumulation of learning control the subsequent advantage. The human race's collective knowledge lives in books, universities, and institutions accessible to anyone. AI's collective knowledge, for now, lives in the servers of a small number of companies. The local personal layer is a potential divergence from that pattern. A form of learned intelligence that might genuinely belong to the individual. Whether companies allow that, or quietly absorb it, will be one of the defining technology policy questions of the next decade.

AGI and what comes after it

Artificial General Intelligence, AGI, is roughly defined as a system that can match or exceed human cognitive performance across the full range of intellectual tasks. Not just Go, or protein folding, or writing code. Generally. Across domains. Including ones it has not encountered before.

Hassabis has estimated a 50% probability of AGI by 2030. That is four years away. It is a significant statement from the person who runs one of the world's leading AI laboratories, and who was considerably more cautious about such timelines even a few years ago.

LeCun disagrees. He argues that current architectures, including LLMs, are fundamentally insufficient for general intelligence regardless of scale, and that a more fundamental breakthrough is required.

What is notable is that both men agree on the direction. Not more language. Physical understanding. Spatial reasoning. A genuine world model underneath the language capability. The disagreement is about timelines and routes. The destination is the same.

Beyond AGI is superintelligence: a system whose intellectual capabilities exceed those of all humans combined. A system that can improve its own architecture faster than humans can evaluate it is, by definition, one we cannot fully anticipate. The safety implications of that are the subject of a separate post. What matters here is that the route to superintelligence almost certainly runs through the architectures described in this one. World models, continuous learning, the two-layer personal system. These are not just incremental improvements on what exists. They are the components of something qualitatively different.

What it adds up to

Intelligence predates language by 295,000 years. Language multiplied it beyond the individual. World models and the two-layer architecture may be the next multiplication point, this time for machines.

The system being assembled, a physical world model below, a language model above, a continuously updating personal layer in between, would begin to resemble the architecture of a mind. Not a human mind. Something different in nature, alien in scale, foreign in substrate. But perhaps the first AI architecture that could genuinely be said to understand rather than merely to predict.

Whether the knowledge distillation techniques now emerging will eventually produce fluid, real-time transfer of learning between these layers is an open question. The ingredients are being assembled. The full recipe does not yet exist.

And there is the longer question. An AI that has built physical intuitions from watching the world might eventually identify places where our best theories fail, where no written model fully captures what observation reveals. In that scenario, AI would not just be drawing on humanity's accumulated knowledge. It would begin to extend it.

Hassabis puts a 50% probability on AGI by 2030. That is four years away. Whether he is right or wrong, the fact that someone of his standing is saying it tells you something about where this is heading, and how fast.

The pieces are being built now. The architecture is becoming visible. What gets built on top of it is the question that matters.

References

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus & Giroux.
Alderson-Day, B. & Fernyhough, C. (2015). Inner speech: Development, cognitive functions, phenomenology, and neurobiology. Psychological Bulletin, 141(5).
Stephane, M. et al. (2021). Keeping the inner voice inside the head. Brain and Behavior, 11(4). pmc.ncbi.nlm.nih.gov
Hassabis, D. (2024). Unreasonably effective AI. Google DeepMind: The Podcast. deepmind.google
Hassabis, D. (2024). Interview with Dwarkesh Patel. dwarkesh.com
Hassabis, D. (2026). AlphaGo at 10: How AI innovation is paving the path to AGI. Google DeepMind Blog. deepmind.google
LeCun, Y. (2025). NVIDIA GTC keynote. Advanced Machine Intelligence Labs.
Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531.
DeepSeek AI (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.
Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596.
NVIDIA (2025). Cosmos world foundation model platform. nvidia.com
Google DeepMind (2025). Genie 3: A real-time interactive world model.
Meta AI (2025). V-JEPA 2: Self-supervised video world model for robotic planning.