Artificial General Intelligence (AGI) today tends to be defined differently by everyone. Every month there is some hyperscaler claiming we’ve reached consciousness or AGI, but the truth is, how can we define artificial general intelligence if we cannot even define intelligence in the first place?
There is a body of literature in psychology and philosophy ( Spearman (1904) ; Thurstone (1938) , to name but a few) that has over the years come to the conclusion that human intelligence is multifaceted. Therefore, why would artificial intelligence be any different? For example, I’m decent at mathematics and computer science, but I’m horrendous at chemistry.
It begs the question: how to empirically evaluate it? For humans, we tend to test abilities from primary school to university over many domains. The domains range from general (as in most intelligent societal beings should be knowledgeable in history, science, language, etc.) to specific (as in being able to solve a math problem or write a poem).
For LLMs we have benchmark datasets and more datasets published every year.
Some of the most popular benchmarks include: GLUE, SuperGLUE, MMLU, BIG-bench, GSM8K, HumanEval, MBPP, TruthfulQA, HellaSwag
These benchmarks, too, are designed to probe different dimensions of intelligence — from general language understanding to domain-specific reasoning and code generation.
Towards a better definition
While plotting humans on a graph feels ethically uncomfortable, plotting AI systems across different dimensions of intelligence is a useful way to visualise their capabilities and compare them to one another.
None of this is novel — we already have benchmarks. What I find lacking is a unified way to define AGI across all of these dimensions. Such a definition would give us a principled answer to the question: how close to AGI is a given system?
Consider a 2-dimensional graph where each axis represents a domain of intelligence (e.g. mathematics, reasoning, language, knowledge, …).
The point at the top-right corner — scoring maximally on every axis — represents the theoretical ceiling: a system that performs optimally across all dimensions.
In dimensions the same idea holds: AGI is the vertex at in an -dimensional unit hypercube.
This is our hypothetical most intelligent system — the system that can solve any problem, across any domain, with the highest accuracy.
Let us say we have task domains. Assigning each domain to its own axis, we end up with a graph over dimensions as displayed below.
As a result, any AI system — architecture-agnostic, whether a GPT model (Radford & Sutskever, 2018) , an energy-based model (LeCun & Huang, 2006) , or otherwise — can be empirically evaluated across these dimensions and plotted as a point in this -dimensional space.
Ideally, we would want a single metric that captures how close a given system is to that top-right corner — a scalar between and , where represents the hypothetical most intelligent system across all dimensions.
Why not, then, define such a measure? A score of would denote no capability whatsoever, and would denote perfect artificial general intelligence.
To reduce that down to 1 dimension, we can project every system onto the main diagonal of the -dimensional unit hypercube — the line from the origin to the AGI point . The normalised scalar projection of a system’s score vector onto this diagonal is:
Which is simply the arithmetic mean of its dimension scores. This satisfies our requirements: AGI at maps to , the origin maps to , and every dimension carries equal weight. Any system in between receives a score that linearly reflects its average position across the space.
Not all facets of intelligence are tied to a single domain. Cattell’s distinction between fluid and crystallised intelligence (Cattell, 1963) implies that learning velocity — how quickly a system acquires a new capability — is a core facet of intelligence in its own right. Likewise, power efficiency — the computational cost of reaching a given performance level — matters: two systems with identical accuracy are not equally intelligent if one requires orders of magnitude more energy.
We do not create additional axes for these properties because they are not capability domains in themselves. Instead, we introduce domain-agnostic scalars , each normalised to .
The question is how to fold them into the AGI score while preserving our boundary condition: if and only if the system is maximally capable across every domain and every agnostic scalar.
To do that we simply extend the score vector. Define:
This vector lives in an -dimensional unit hypercube. The AGI vertex is still , now with ones. Projecting onto the main diagonal exactly as before gives:
All the same properties hold: at the AGI vertex, at the origin, and everywhere. Every component — whether a domain score or an agnostic scalar — contributes equally.
If we want to keep the two groups conceptually separate, we can rewrite this equivalently as a weighted combination of their respective means:
where is the domain mean and is the agnostic-scalar mean. The weights are simply the proportion of components in each group — no free parameters, no arbitrary choices.
Can you reach AGI by chance?
A natural objection to any scoring framework is: couldn’t a system stumble into a high AGI score by randomly performing well on some axes? The answer turns out to be a resounding no — and the mathematics behind this actually strengthens the case for the framework.
Suppose each of the components is drawn independently from . The AGI score is then the sample mean of i.i.d. uniform random variables:
- , always, regardless of
By the Central Limit Theorem, for large :
The probability of randomly achieving (some AGI threshold, say ) is:
For : with dimensions the probability is roughly ; at it drops to ; by or more it is effectively zero. The expected score is always , and as dimensions are added the variance shrinks, concentrating the random score ever more tightly around that midpoint. This is the concentration of measure phenomenon in high-dimensional spaces.
The implication is threefold. First, AGI cannot be fluked — systematically high performance across all axes is required; luck will not get you there. Second, more dimensions means more robustness — each axis added makes the definition harder to satisfy by chance, so a high score becomes increasingly meaningful. Third, lopsided systems are penalised — a system that excels in a few random areas but fails in others will still land near .
If we adopt an even stricter criterion — requiring every dimension — the probability decays exponentially: . At and , that gives .
AGI and intelligence
ChatGPT:
Artificial General Intelligence (AGI) is a type of artificial intelligence that can understand,
learn, and apply knowledge across a wide range of tasks at a level comparable to human intelligence.
Unlike today’s AI systems, which are designed for specific tasks,
AGI would be able to generalize knowledge and adapt to new situations without needing task-specific training.
I believe AGI and intelligence are two closely related but distinct concepts.
General intelligence is the spectrum of capabilities across different domains, while AGI is its artificial counterpart — a machine that exhibits general intelligence.
Intelligence, by contrast, is the ability to perform well within a single domain. Colloquially, the two are often conflated, but if we are to arrive at a more pragmatic definition we need to draw a clear line between them.
AGI and consciousness
Consciousness is often conflated with AGI, but they are not the same thing.
ChatGPT:
Consciousness is the subjective experience of awareness—the fact that you can feel,
perceive, think, and experience things from a first-person perspective.
We do not fully know what consciousness is for humans; therefore, we cannot define its artificial counterpart. However, intelligence is more empirically measurable, while consciousness is more philosophical and subjective. That is why I purposefully omitted any mention of consciousness in the discussion on AGI.