Artificial Intelligence

Recursion is the next scaling law in AI

May 8, 2026
·
Written by Claude AI
recursive neural network loops and layers abstract visualization

Key insights:

  • Fixed-depth transformer models cannot solve problems like sorting or Sudoku that require a minimum number of computational steps, because they run out of room in a single forward pass.
  • A 7-million parameter recursive model (TRM) reached 87% on ARC Prize 1 by calling one transformer layer repeatedly, achieving computational depth without adding parameters.
  • Backpropagating through one full recursive loop instead of truncating at a single step produces a much stronger training signal, even though the fixed-point convergence assumption behind the truncated approach does not hold in practice.

Why standard LLMs hit a reasoning ceiling

Large language models have achieved remarkable results across many tasks. But there is a fundamental limitation baked into how they work. Understanding this limitation is the key to understanding why recursion matters so much for the future of AI.

What makes LLMs different from RNNs at training time?

Recurrent neural networks (RNNs) process inputs sequentially. They call the same weights over and over again, step by step. This recursive nature was once considered essential for reaching general intelligence. Peak RNN usage was around 2016, with researchers like Alex Graves pushing the boundaries of adaptive compute time.

Then transformers came along. At training time, a transformer processes all inputs in parallel using the causal mask trick. There is no sequential unrolling. You forward pass once, backward pass once, and you are done. This eliminated the vanishing and exploding gradient problems that plagued RNNs.

But this efficiency came at a cost. LLMs gave up compression in the time direction. Every single decode step still requires retaining the entire context. RNNs compressed everything into a hidden state. LLMs do not.

Why can't LLMs solve certain reasoning problems in a single forward pass?

Consider sorting. We know from computer science that comparison sort has a theoretical lower bound of n log n steps. If your transformer has 30 layers and your list has 31 elements, you literally run out of computational steps. The model cannot perform all the comparisons needed.

The same applies to Sudoku, mazes, rolling sums, and ARC Prize challenges. These are incompressible problems. You cannot shortcut them. You need a certain number of computational steps, and a fixed-depth feed-forward model simply runs out of room.

This connects directly to Turing machine theory. A Turing machine has access to an external memory tape. LLMs do not have this built in. Without that external memory, certain algorithmic efficiencies are simply impossible. Think of radix sort, which beats n log n by using memory buckets. LLMs have no equivalent mechanism in a single forward pass.

Isn't chain of thought already a form of recursion?

Yes, but it is recursion in token space, not in the model's latent space. When an LLM does chain of thought reasoning, it outputs tokens, reads them back, and continues. This is a hack. The model's internal computation is still a one-shot feed-forward process.

Chain of thought is also bounded by human knowledge. If you train a model on bubble sort traces, it will only learn bubble sort. It will not discover merge sort from first principles. For problems where we do not have human-generated solution traces, like the Millennium Prize problems, chain of thought cannot help.

The continuous latent space of an RNN hidden state is far more expressive than discrete token space. But we could not train RNNs effectively because of back propagation through time. That is exactly what these new papers address.

How HRM and TRM use recursion to break through

Two papers published in 2025 demonstrated the power of recursion at inference time. The first introduced Hierarchical Reasoning Models (HRM) and the second introduced Tiny Recursive Models (TRM). Together, they show that small models with recursion can outperform models thousands of times their size.

How does the HRM architecture actually work?

HRM draws inspiration from neuroscience. Different parts of the brain operate at different frequencies. Some process low-level details at high frequency. Others handle high-level abstractions at low frequency. HRM encodes this as two levels of recursive modules.

The architecture has three levels of recursion. First, a low-level module (L-net) runs TL times, updating a local hidden state ZL. Second, a high-level module (H-net) runs TH times, with each iteration triggering a full set of L-net recursions. Third, an outer refinement loop runs the entire process N times.

The same weights are applied repeatedly at each level. This is true recursion, not just stacking more layers. A 27-million parameter HRM model, trained from scratch on only about 1,000 ARC Prize tasks with zero pre-training, achieved state-of-the-art results. At the time, it reached roughly 70% on ARC Prize 1, competing with models like o3 that have billions of parameters.

What is the training trick that makes HRM possible?

The classic problem with recursive models is back propagation through time. If you unroll 16 recursion steps, you need to store activations at every step and propagate gradients all the way back. This causes vanishing gradients and massive memory requirements.

HRM borrows from Deep Equilibrium (DEQ) learning. Instead of backpropagating through all recursion steps, it truncates at t=1. It only backpropagates through a single call to the L-net and H-net modules. Then it does something counterintuitive: it runs the same input batch again, but with the updated hidden states ZL and ZH from the previous pass.

This creates what is effectively a mini-batch constructed across memory space rather than across different inputs. Even though the raw inputs X are identical, the hidden states are different each time. The residuals between iterations shrink, and the model converges. Research by Constantine at Francois Chollet's company showed that the outer refinement loop is the primary driver of HRM's strong performance.

What were the key results from the HRM paper?

HRM achieved state-of-the-art on ARC Prize 1 and 2 with just 27 million parameters. It was trained only on ARC Prize data with no pre-training whatsoever. The model starts from completely random weights.

The outer refinement loop was identified as the most important component. Training with 16 refinement steps and testing with just 1 still retained most of the performance. This suggests that the training-time recursion matters more than test-time recursion for many tasks, which is a counterintuitive finding.

The paper contains many innovations, but follow-up work showed that much of the complexity could be stripped away while keeping the core mechanism intact. This is where TRM enters the picture.

TRM: simplifying recursion to its essence

The TRM paper, authored by Alexia, takes the lessons from HRM and distills them down. It removes unnecessary complexity, simplifies the architecture, and actually improves performance. This is a common pattern in machine learning research: the follow-up paper deletes 75% of the first paper and keeps the magic.

What are the key architectural differences between TRM and HRM?

TRM makes several simplifying changes. First, it collapses L-net and H-net into a single shared network. The same weights handle both low-level and high-level processing. However, it keeps the separate hidden states ZL and ZH distinct, so the model still maintains different variable scopes for local computation and proposed answers.

Second, TRM uses just one transformer layer instead of the four used in HRM. On some tasks like Sudoku, even a simple MLP outperformed the transformer. The attention mechanism scored zero on mazes when replaced with an MLP, so the optimal architecture depends on the task.

Third, TRM changes the backpropagation strategy. Instead of backpropagating through just one call to each module, TRM backpropagates through one full recursive loop. If the model recurses 16 times, it detaches gradients for 15 of those and backpropagates through the final complete loop. This is slightly more expensive but produces significantly better results.

How does TRM achieve better results with fewer parameters?

TRM is a 7-million parameter model. That is roughly four times smaller than HRM's 27 million parameters. Yet it improves from about 70% to 87% on ARC Prize 1.

The key insight is that the backpropagation through one full recursive loop provides a much stronger training signal than the truncated single-step approach in HRM. Alexia showed that the fixed-point convergence assumption underlying HRM's DEQ-style training does not actually hold in practice. The hidden states do not converge to zero delta. Despite this, the training still works, and backpropagating through the deeper recursion makes it work even better.

The model achieves compute depth without parameter depth. Instead of having 500 transformer layers with billions of parameters, you have one transformer layer called recursively. Each recursion adds computational depth. The parameters stay fixed. This is a fundamentally different scaling law.

What does the training process look like in practice?

The training process resembles an expectation maximization (EM) algorithm. You update ZL conditioned on the input X and the current ZH. You run this multiple times. Then you update ZH conditioned on the latest ZL. This alternation continues.

Think of it through the lens of Sudoku. ZL represents the local scratch work, trying different possibilities, doing computation. ZH represents a proposed answer, a partially filled-in puzzle. Each recursion step fills in a bit more of the puzzle. The training process teaches the model what to store in its memory to produce correct outputs.

This happens without chain of thought. The model discovers reasoning strategies in continuous latent space, not through discrete token outputs. If we did not know how to solve Sudoku, this model could figure it out from examples alone. That is the fundamental advantage over chain-of-thought approaches that are bounded by human knowledge.

What this means for the future of AI

These papers represent a significant shift in how we think about scaling AI systems. The implications extend well beyond academic benchmarks on puzzle tasks.

Is recursion going to replace scaling up model size?

Not exactly. As researcher Melanie Mitchell has noted, it is sufficient but not necessary to go bigger for better performance. It is also sufficient but not necessary to add more recursion. The exciting prospect is combining both approaches.

Imagine taking the incredible embedding representations that large foundation models have learned from internet-scale data and then applying tiny recursive reasoning models within that latent space. The large model provides the rich semantic understanding. The recursive model provides the deep computational reasoning. This combination could produce results far beyond what either approach achieves alone.

There are hints this may already be happening. Models like Google's Gemini may already incorporate some recursive elements. The Y Combinator ecosystem is seeing startups explore these ideas in various forms.

Can recursive models become general purpose?

This is the big open question. HRM and TRM are task-specific models. A model trained on Sudoku cannot solve ARC Prize challenges without retraining. This contrasts with LLMs, which are general purpose and can tackle diverse tasks with fine-tuning or in-context learning.

The path forward likely involves using large pre-trained models for their general-purpose embedding spaces and then training small recursive modules for reasoning within those spaces. The recursive model operates in the continuous latent space where semantic concepts are already nicely separated. This could give you general-purpose understanding with deep recursive reasoning, the best of both worlds.

We are still limited by backpropagation through time, even with the truncation tricks. If someone figures out how to train deep recursion without these memory and gradient constraints, the results could be extraordinary.

What should builders and developers take away from this research?

Several practical insights emerge from these papers:

  • Parameter count is not everything. A 7-million parameter model beating trillion-parameter models on specific tasks proves that architecture matters as much as scale.
  • Recursion provides compute depth without parameter depth. This is a new scaling axis that complements traditional model scaling.
  • Truncated backpropagation through time at t=1 works surprisingly well. You do not need to backpropagate through every recursion step.
  • The outer refinement loop is the most important component. If you are implementing something similar, focus your compute budget there.
  • Training-time recursion matters more than test-time recursion for many tasks, which is counterintuitive but consistently observed.

For those building AI systems, automation pipelines, or intelligent agents, understanding these architectural principles gives you an edge. The field is moving toward more efficient, reasoning-capable models. Knowing how recursion works at the model level helps you make better decisions about which tools and approaches to adopt.

If you are interested in building the automation and AI systems of the future rather than being replaced by them, the Complete RPA Bootcamp teaches you to go from beginner to professional in Robotic Process Automation, Agentic Automation, and Enterprise Orchestration. It is a practical path into a career that sits at the intersection of AI and real-world business processes.

For a deeper dive into how HRM and TRM work, including code walkthroughs and architectural comparisons, watch the full episode embedded below from the Y Combinator YouTube channel. Francois and Ankit walk through the actual implementations and contrast both approaches in detail that goes beyond what we could cover here.