Artificial General Intelligence (AGI) today tends to be defined differently by everyone. Every month there is some hyperscaler claiming we’ve reached consciousness or AGI, but the truth is, how can we define artificial general intelligence if we cannot even define intelligence in the first place?

There is a body of literature in psychology and philosophy ( Spearman (1904) ; Thurstone (1938) , to name but a few) that has over the years come to the conclusion that human intelligence is multifaceted. Therefore, why would artificial intelligence be any different? For example, I’m decent at mathematics and computer science, but I’m horrendous at chemistry.

It begs the question: how to empirically evaluate it? For humans, we tend to test abilities from primary school to university over many domains. The domains range from general (as in most intelligent societal beings should be knowledgeable in history, science, language, etc.) to specific (as in being able to solve a math problem or write a poem).

For LLMs we have benchmark datasets and more datasets published every year.

Some of the most popular benchmarks include: GLUE, SuperGLUE, MMLU, BIG-bench, GSM8K, HumanEval, MBPP, TruthfulQA, HellaSwag

These benchmarks, too, are designed to probe different dimensions of intelligence — from general language understanding to domain-specific reasoning and code generation.

Towards a better definition

While plotting humans on a graph feels ethically uncomfortable, plotting AI systems across different dimensions of intelligence is a useful way to visualise their capabilities and compare them to one another.

None of this is novel — we already have benchmarks. What I find lacking is a unified way to define AGI across all of these dimensions. Such a definition would give us a principled answer to the question: how close to AGI is a given system?

Consider a 2-dimensional graph where each axis represents a domain of intelligence (e.g. mathematics, reasoning, language, knowledge, …).

The point at the top-right corner — scoring maximally on every axis — represents the theoretical ceiling: a system that performs optimally across all dimensions.

In NN dimensions the same idea holds: AGI is the vertex at (1,1,,1)(1, 1, \dots, 1) in an NN-dimensional unit hypercube.

This is our hypothetical most intelligent system — the system that can solve any problem, across any domain, with the highest accuracy.

Let us say we have NN task domains. Assigning each domain to its own axis, we end up with a graph over NN dimensions as displayed below.

As a result, any AI system — architecture-agnostic, whether a GPT model (Radford & Sutskever, 2018) , an energy-based model (LeCun & Huang, 2006) , or otherwise — can be empirically evaluated across these dimensions and plotted as a point in this NN-dimensional space.

Ideally, we would want a single metric that captures how close a given system is to that top-right corner — a scalar between 00 and 11, where 11 represents the hypothetical most intelligent system across all dimensions.

Why not, then, define such a measure? A score of 0.00.0 would denote no capability whatsoever, and 1.01.0 would denote perfect artificial general intelligence.

To reduce that down to 1 dimension, we can project every system onto the main diagonal of the NN-dimensional unit hypercube — the line from the origin (0,0,,0)(0,0,\dots,0) to the AGI point (1,1,,1)(1,1,\dots,1). The normalised scalar projection of a system’s score vector d=(d1,d2,,dN)\mathbf{d} = (d_1, d_2, \dots, d_N) onto this diagonal is:

s=1Ni=1Ndis = \frac{1}{N}\sum_{i=1}^{N} d_i

Which is simply the arithmetic mean of its dimension scores. This satisfies our requirements: AGI at (1,1,,1)(1,1,\dots,1) maps to 1.01.0, the origin maps to 0.00.0, and every dimension carries equal weight. Any system in between receives a score that linearly reflects its average position across the space.

Not all facets of intelligence are tied to a single domain. Cattell’s distinction between fluid and crystallised intelligence (Cattell, 1963) implies that learning velocity — how quickly a system acquires a new capability — is a core facet of intelligence in its own right. Likewise, power efficiency — the computational cost of reaching a given performance level — matters: two systems with identical accuracy are not equally intelligent if one requires orders of magnitude more energy.

We do not create additional axes for these properties because they are not capability domains in themselves. Instead, we introduce MM domain-agnostic scalars λ1,λ2,,λM\lambda_1, \lambda_2, \dots, \lambda_M, each normalised to [0,1][0, 1].

The question is how to fold them into the AGI score while preserving our boundary condition: s=1s = 1 if and only if the system is maximally capable across every domain and every agnostic scalar.

To do that we simply extend the score vector. Define:

v=(d1,,dN,λ1,,λM)\mathbf{v} = (d_1, \dots, d_N, \lambda_1, \dots, \lambda_M)

This vector lives in an (N+M)(N+M)-dimensional unit hypercube. The AGI vertex is still (1,1,,1)(1, 1, \dots, 1), now with N+MN + M ones. Projecting onto the main diagonal exactly as before gives:

s=1N+M(i=1Ndi  +  j=1Mλj)s = \frac{1}{N + M}\left(\sum_{i=1}^{N} d_i \;+\; \sum_{j=1}^{M} \lambda_j\right)

All the same properties hold: s=1s = 1 at the AGI vertex, s=0s = 0 at the origin, and s[0,1]s \in [0, 1] everywhere. Every component — whether a domain score or an agnostic scalar — contributes equally.

If we want to keep the two groups conceptually separate, we can rewrite this equivalently as a weighted combination of their respective means:

s=NN+Mdˉ  +  MN+Mλˉs = \frac{N}{N+M}\,\bar{d} \;+\; \frac{M}{N+M}\,\bar{\lambda}

where dˉ=1Ni=1Ndi\bar{d} = \frac{1}{N}\sum_{i=1}^{N} d_i is the domain mean and λˉ=1Mj=1Mλj\bar{\lambda} = \frac{1}{M}\sum_{j=1}^{M} \lambda_j is the agnostic-scalar mean. The weights are simply the proportion of components in each group — no free parameters, no arbitrary choices.

Can you reach AGI by chance?

A natural objection to any scoring framework is: couldn’t a system stumble into a high AGI score by randomly performing well on some axes? The answer turns out to be a resounding no — and the mathematics behind this actually strengthens the case for the framework.

Suppose each of the K=N+MK = N + M components is drawn independently from Uniform(0,1)\text{Uniform}(0, 1). The AGI score s=1Kvis = \frac{1}{K}\sum v_i is then the sample mean of KK i.i.d. uniform random variables:

  • E[s]=0.5\mathbb{E}[s] = 0.5, always, regardless of KK
  • SD(s)=112K\text{SD}(s) = \frac{1}{\sqrt{12K}}

By the Central Limit Theorem, for large KK: sN ⁣(0.5,  112K)s \approx \mathcal{N}\!\left(0.5,\; \frac{1}{12K}\right)

The probability of randomly achieving sts \geq t (some AGI threshold, say 0.90.9) is:

P(st)1Φ ⁣((t0.5)12K)P(s \geq t) \approx 1 - \Phi\!\left((t - 0.5)\sqrt{12K}\right)

For t=0.9t = 0.9: with K=3K = 3 dimensions the probability is roughly 0.82%0.82\%; at K=10K = 10 it drops to 0.0006%0.0006\%; by K=50K = 50 or more it is effectively zero. The expected score is always 0.50.5, and as dimensions are added the variance shrinks, concentrating the random score ever more tightly around that midpoint. This is the concentration of measure phenomenon in high-dimensional spaces.

The implication is threefold. First, AGI cannot be fluked — systematically high performance across all axes is required; luck will not get you there. Second, more dimensions means more robustness — each axis added makes the definition harder to satisfy by chance, so a high score becomes increasingly meaningful. Third, lopsided systems are penalised — a system that excels in a few random areas but fails in others will still land near 0.50.5.

If we adopt an even stricter criterion — requiring every dimension t\geq t — the probability decays exponentially: P(min(vi)t)=(1t)KP(\min(v_i) \geq t) = (1 - t)^K. At t=0.9t = 0.9 and K=10K = 10, that gives 0.110=10100.1^{10} = 10^{-10}.

AGI and intelligence

ChatGPT:

Artificial General Intelligence (AGI) is a type of artificial intelligence that can understand, 
learn, and apply knowledge across a wide range of tasks at a level comparable to human intelligence. 
Unlike today’s AI systems, which are designed for specific tasks, 
AGI would be able to generalize knowledge and adapt to new situations without needing task-specific training.

I believe AGI and intelligence are two closely related but distinct concepts.

General intelligence is the spectrum of capabilities across different domains, while AGI is its artificial counterpart — a machine that exhibits general intelligence.

Intelligence, by contrast, is the ability to perform well within a single domain. Colloquially, the two are often conflated, but if we are to arrive at a more pragmatic definition we need to draw a clear line between them.

AGI and consciousness

Consciousness is often conflated with AGI, but they are not the same thing.

ChatGPT:

Consciousness is the subjective experience of awareness—the fact that you can feel, 
perceive, think, and experience things from a first-person perspective.

We do not fully know what consciousness is for humans; therefore, we cannot define its artificial counterpart. However, intelligence is more empirically measurable, while consciousness is more philosophical and subjective. That is why I purposefully omitted any mention of consciousness in the discussion on AGI.

References

Spearman, C. (1904). General Intelligence, Objectively Determined and Measured. American Journal of Psychology, 15(2), 201–293.
Thurstone, L. L. (1938). Primary Mental Abilities. University of Chicago Press.
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. & Huang, F. J. (2006). A Tutorial on Energy-Based Learning. In Bakir, G., Hofmann, T., Schölkopf, B., Smola, A. J. & Taskar, B. (Eds.), Predicting Structured Data. MIT Press.
Cattell, R. B. (1963). Theory of Fluid and Crystallized Intelligence: A Critical Experiment. Journal of Educational Psychology, 54(1), 1–22.