Dec 19, 2025 · 6 min read

Scaling Laws and AGI

What scaling laws do—and do not—tell us about general intelligence

LLMScaling LawsAGIObjectives

Over the past few years, scaling laws and AGI have become two of the most charged terms in AI. Vast amounts of capital and talent have been devoted to scaling language models under the implicit belief that sufficient scale will eventually yield general intelligence.

Recently, however, even some strong proponents of scaling have begun to express doubts. For example, Ilya Sutskever has publicly questioned whether scaling alone is enough to reach AGI. This raises a natural set of questions:

  • Are scaling laws breaking down?
  • Are we simply hitting diminishing returns?
  • Or have we been overly confident about what scaling actually buys us?

My view is that scaling laws remain largely valid—but that optimizing next-token prediction loss alone may not be sufficient to justify confidence in achieving general intelligence.


What do we mean by AGI?

Before discussing scaling, it helps to clarify what I mean by artificial general intelligence.

Here I will use a common operational definition:

An AGI is a system that can perform any task that a smart human can perform, given appropriate training.

This definition emphasizes task execution, not just answering questions. Many human tasks—such as scientific research, long-term planning, or learning a new domain from scratch—are not single prompt–response problems. They require deciding what to work on, decomposing vague goals into subproblems, revising plans over time, and persisting under uncertainty.

This distinction matters because a purely reactive "oracle" system—one that only responds when prompted—may still fall short. While such a system could provide high-quality answers, some human tasks are inherently self-directed. If a human were forced to behave purely as an oracle, only responding to questions chosen by someone else, they would not be able to perform those tasks either.

This definition does not rule out language models in principle. A sufficiently capable model, embedded in an agentic loop with memory, tools, and interaction, could still meet it. The question is whether the training objective we currently rely on provides sufficient pressure toward such capabilities.


What scaling laws for LLMs actually say

Scaling laws for language models are empirical regularities, not theoretically derived laws. Like Moore’s law, they summarize observed trends rather than establish fundamental guarantees.

Work such as Scaling Laws for Neural Language Models (Kaplan et al., 2020) and Training Compute-Optimal Large Language Models (Hoffmann et al., 2022, often referred to as the “Chinchilla” paper) shows that training loss decreases predictably as we scale model size, data size, and compute.

A simplified form often used to describe this behavior is:

L(N,D)L+aNα+bDβL(N, D) \approx L_{\infty} + a N^{-\alpha} + b D^{-\beta}

Where:

  • NN is model size
  • DD is dataset size
  • LL_{\infty} is an irreducible loss floor
  • a,ba, b depend on architecture, training setup, and data quality
  • α,β\alpha, \beta are empirically fitted constants

The key takeaway is straightforward: as long as we increase scale, training loss continues to decrease in a smooth and predictable way, albeit with diminishing returns.

This empirical success is what motivated the belief that scaling could eventually lead to AGI.


Why scaling loss is treated as a proxy for intelligence

The implicit reasoning behind the scaling-to-AGI argument usually goes something like this:

Human language encodes knowledge, reasoning, planning, and abstractions about the world. To predict human language well, a model must internalize many of the same structures humans rely on. Therefore, sufficiently good next-token prediction should imply general intelligence.

There is real force to this argument. Modern language models exhibit emergent capabilities such as in-context learning, multi-step reasoning, and solving problems that do not appear verbatim in their training data (Wei et al., 2022). These behaviors are not trivial, and they demonstrate that scaling next-token prediction goes far beyond surface-level mimicry.

But the step from impressive emergent behavior to general intelligence is still an inference—one that scaling laws themselves do not establish.


Prediction versus task execution

Next-token prediction is an extraordinarily powerful objective. It encourages models to learn:

  • linguistic fluency
  • factual regularities
  • statistical structure of the world as reflected in text
  • patterns of human reasoning

What it does not directly optimize for is:

  • initiating goals
  • deciding what to explore
  • managing long-horizon uncertainty
  • decomposing ill-defined tasks
  • persisting without external prompting

As a result, a model can become extremely good at predicting the outputs of intelligent agents without being able to perform the same class of tasks those agents perform.

The distinction here is not metaphysical. It is behavioral and operational. A system trained purely on static prediction may reason well when prompted, yet still fail under distribution shift, adversarial framing, or open-ended task settings where success criteria are not predefined.

Under the task-based definition of AGI above, this gap matters.


Why scaling laws can hold without guaranteeing AGI

Scaling laws tell us how performance improves when we optimize a particular loss function. They do not tell us whether that loss function is aligned with general intelligence.

It is therefore possible for training loss to decrease indefinitely while:

  • behavior remains reactive
  • self-directed learning does not emerge
  • long-horizon task competence remains brittle

In that sense, scaling laws may describe a path toward increasingly accurate simulations of intelligent behavior, without guaranteeing the ability to perform the full range of tasks humans can.

To be clear, this is not a claim that scaling cannot lead to AGI—only that scaling laws alone do not justify confidence that it will.


What might come next

If progress in machine learning is driven largely by objective functions, then the limitation may not be scale itself, but what we choose to optimize.

Several directions attempt to move beyond pure next-token prediction, including reinforcement learning from human feedback (Ouyang et al., 2022), constitutional AI (Bai et al., 2022), world-model-based training, and embodied or interactive learning. Each introduces important ingredients, yet most still rely on externally defined rewards or human preferences, and it remains unclear whether they provide a scalable objective for open-ended intelligence rather than increasingly refined imitation.

The uncomfortable reality is that we do not yet know what a loss function for general intelligence should look like. How should a system be incentivized to:

  • choose its own goals
  • explore efficiently
  • build and revise internal models
  • act robustly under uncertainty

These remain open problems.


Closing thought

Scaling laws describe how well we can optimize next-token prediction. They tell us nothing definitive about whether next-token prediction alone is sufficient for general intelligence.

The difference matters. If we have been climbing the wrong hill—one that plateaus before reaching AGI—no amount of additional scaling will close the gap. And if that is the case, the sooner we acknowledge the possibility, the sooner we can search for better objectives.