Why There Will Be No AGI

August 4, 2023


August 2023

There is a lot of debate about the capabilities and limitations of large language models like GPT-3, GPT-4, Claude, and Llama. Do they display emergent capabilities ? Do they merely display memorization but not generalization powers ? Is it correct to imply that they have reasoning abilities ? Do they display human-level natural language understanding ? How do we even define human-level natural language understanding ? Will it be ever possible to get rid of the hallucination problem ? Is the Natural Language Processing field obsolete (in a Fukuyama End of History style) ?

The November surprise

First, let me say something that is not so original, because it echoes a lot of the thoughts I have seen around on LLMs. It is incredible to see what next word prediction models can do. No one anticipated that we would be able to do so many useful things with them, until OpenAI opened Pandora’s box and released ChatGPT in November 2022. It is certainly a testament to the power of language and specifically context (given that attention is all we need), that we can do summarization, translation, and question answering among other things with the same GPT model.

There is so much power in language, so much wisdom, knowledge. There is such immense power in written language. It’s beautiful that we are able to distill a kind of inner structure of human knowledge by merely attempting to predict the next word in a sentence, based on the context. This implies that there is a considerable amount of standardization in the way we say or write things. Therefore, true novelty is quite rare, unpredictable, and precious. The space created by language, with its words and grammar, is infinite, yet simultaneously constrained by logic and reality, by human experience and what we’ve collectively gleaned from it since the beginning of time. This implies that our collective “semantic space” (the space of all the meaningful sentences or texts we can create) is more structured and predictable than we previously thought.

I find all of this fascinating. It’s as if there are certain arithmetic or algebraic laws of language that we don’t fully understand yet, but which make some sequences of words meaningful and human-like, while others are not. This is a testament to the power of mathematics and computer science. We are able to construct a system that generates human-like text, based on advanced algorithms and a numerical representation of language. Data challenges

LLMs are trained on an enourmous amout of text data fetched from the internet. In the training data, we thus have beautiful poems and ugly prose, factual information and lies, thoughtful philosophical inquiries and conspiracy theories, humanist text and racist pamphlets. Although it is possible to filter some of the filth, it is virtually impossible to perfectly “sanitize” the data used to train LLMs. It is simply impossible because, it would require a robust way to evaluate what is true and what is false, what is beautiful and what is ugly. And even we humans can’t do that well and reliably, since there is no consensus about truth and beauty.

Since we can’t always agree with each other or determine the truth, it’s preposterous to believe we could build a system (LLM) that would always produce truthful or beautiful sentences. That’s why I’m sometimes uncomfortable with the expectations that people place on LLMs. People expect LLMs to never be wrong about anything, even though this is not even feasible for humans. They take one instance where the LLM failed to answer correctly (usually in a zero-shot case) and use it to belittle the capabilities of LLMs, despite the fact that time and time again, in test after test, machine learning systems, including LLMs, have proven to surpass human abilities.


Machine Learning Systems vs Human Abilities

Some of these criticisms are valid. It’s true that LLMs aren’t reasoning machines per se. They exhibit a semblance of reasoning that’s simply embedded in lexical cues. See the paper below that I highly recommend:

But if the LLM is 65% or 70% correct on the complex task it has to accomplish, and if I can even boost that accuracy to 80% or 90% through clever prompt engineering strategies, doesn’t that make the LLM a powerful tool? Perhaps even more proficient than most humans at natural language understanding and question answering across several topics?


I find it bizarre that people often criticize LLMs’ capabilities by stating they don’t exhibit human-level performance because they sometimes fail, when occasionally failing is, in fact, a quintessentially human trait. Moreover, how do we reconcile the fact that LLMs perform far better than most humans on all the standardized tests they’ve passed? I recently wrote an article about using GPT-4 for the CFA Level I exam, and the results were impressive. Most junior Financial Analysts do not reach that level. Take a look:


“Your Score” corresponds to GPT-4 score

I believe the issue arises when people use their conception of what AGI (Artificial General Intelligence) should be to judge LLMs. However, LLMs are not AGIs, so let’s refrain from using AGI’s criteria, such as the ability to adapt to ever-changing circumstances by learning new skills autonomously (as humans do), to evaluate LLMs. Additionally, expecting LLMs to be 100% correct seems a bit farfetched when even we humans are not. Furthermore, LLMs generate language, and since language, by definition, is rife with ambiguities, truths and falsehoods, beauty and ugliness, it’s impossible to have a ‘perfect’ LLM.

This has led me to question the feasibility of AGI. How will we recognize AGI when we achieve it? Is it supposed to be a system that always provides the correct answers to all questions? If so, that will be impossible, because there’s no way to know the correct answers to all questions all the time. There will always be questions that are subject to debate. Truth is, and always will be, a quest. Is AGI supposed to always make the right decisions? If so, then that’s impossible too. The world is too complex for any system to perfectly discern the best decision for every situation.

The problem is that we expect AGI to be human-like but without the flaws of humans. Yet, we can’t even agree on what we consider to be flaws, as there’s no consensus. Another definition of AGI might suggest that the system should simply be able to acquire new skills on the fly, as humans do, but faster, I presume. After all, who would want a system that takes 10 years to master algebra as humans do? Yet, even people who hold Ph.Ds in Algebra can make errors in algebra, so should we expect an AGI to be any different? This is only achievable if we suppose errors, mistakes, are purely human flaws, and not inherent to the process of acquiring and creating knowledge.

I notice a tendency to anthropomorphize AGI, reflecting a seemingly inherent human desire to create a superhuman or god-like entity. At its core, AGI will just be a tool. Since the invention of the first stone tools by Homo habilis 2.6 million years ago, we humans have always relied on increasingly complex and powerful tools to improve our lives and realize our dreams. AGIs, robots, or whatever we invent next will always be tools and, as such, will require human control. We won’t invent a cyborg-like AGI, and if we do, we had better know how to control it, because it will be as imperfect as any human creation. We wouldn’t want to grant autonomous killing capability to something that can make mistakes but is extremely efficient at certain specific tasks.

I view LLMs as databases of human knowledge. While they may not resemble your typical SQL or NoSQL database, they are, in essence, semantic databases. However, as human knowledge is imperfect, these databases are also imperfect, prone to occasionally outputting incorrect information, yet they remain overall powerfully useful.

It’s incredible that now, when I want to understand something, I simply open ChatGPT (GPT-4 in my case) and ask questions to further delve into the topic of my inquiry. This is the ultimate Socratic experience, one which necessitates critical thinking, given that I know ChatGPT can produce incorrect responses. So the experience is truly akin to having a knowledgeable companion with whom to discuss any topic, bearing in mind that this companion might err in its utterances, just as I might be wrong in my assumptions.

Do LLMs display emergent capabilities?

In my opinion, they do not. People often perceive emergent capabilities in LLMs largely because these systems are black boxes, and no one truly comprehends why they behave the way they do in detail. That being said, while LLMs may occasionally produce unexpected or surprisingly accurate results, it’s essential to remember that these instances are generally the result of a vast amount of underlying data and complex algorithms at work, rather than any emergent, self-developed capability. Therefore, while fascinating and undeniably powerful, it’s critical not to overstate or misinterpret what LLMs are capable of.

Do LLMs merely exhibit memorization but lack generalization capabilities?

It’s challenging to make a definitive statement. I believe there’s a degree of memorization involved, but I also suspect that LLMs possess some ability to generalize. If they didn’t, they wouldn’t be able to handle such a vast array of human scenarios so effectively. So, while LLMs certainly capitalize on their massive training data to generate responses, there seems to be an element of pattern recognition and extrapolation that enables them to apply learned concepts in a broader context.

Do they display human-level natural language understanding?

I believe that LLMs don’t, and will never truly understand human language in the way humans do. However, I also think that LLMs mimic natural language understanding in a way that surpasses the average human’s capability. It’s essential to differentiate between a deep, inherent understanding, which is tied to human experience and consciousness, and a sophisticated mimicking of understanding, which is based on patterns, data, and algorithms. Will it ever be possible to entirely eliminate the hallucination problem?

I doubt we’ll be able to entirely eradicate the hallucination problem, as it is fundamentally linked to the fact that our written texts often contain contradictory facts and beliefs. It seems somewhat unrealistic to expect LLMs to circumvent the inconsistencies that we ourselves embed in our texts. Furthermore, nothing in the current training process of LLMs is specifically designed to enhance factuality. I believe the factuality demonstrated by current LLMs is largely a derivative of memorization — less noise about a certain subject enables the LLM to more effectively remember the correct facts.

Is the Natural Language Processing field obsolete (in a Fukuyama End of History style) ?

We all know what happened after Fukuyama declared the end of History. I think the same is true for NLP. By no means is NLP obsolete. In fact, we’re continually finding new challenges, methods, and perspectives in the field. The ongoing advancements in NLP mirror history’s progression, where new eras unfold and paradigms shift. So, far from reaching its endpoint, NLP, like history, continues to evolve and advance.

Key concepts about GPT models

GPT (Generative Pre-trained Transformer) models like GPT-4 are designed on the principles of transformer architectures, which use deep learning and a probabilistic understanding of language rather than formal grammatical rules. Here are the key concepts underlying the magic of ChatGPT:

  1. Embeddings: The first step in GPT’s operation involves translating words into a dense vector representation. This is typically achieved using a method called word embeddings, where each unique word is associated with a high-dimensional vector. The positions of these vectors in the space are learned such that words with similar meanings are close together.
  2. Transformers: GPT uses a transformer architecture, which involves mathematical operations like matrix multiplication, scaling, addition, and application of activation functions (such as the softmax function). The transformer model operates on sequences of embeddings and uses a mechanism called attention to weigh the importance of different words in the sequence when predicting the next word.
  3. Probabilistic Language Modeling: GPT models language as a sequence of probabilistic events, where each word is chosen based on the probability distribution conditioned on the preceding words. It uses the concept of maximum likelihood estimation to adjust its internal parameters during training. The model’s prediction for the next word is the word that has the highest probability given the preceding words.
  4. Backpropagation and Gradient Descent: GPT learns its internal parameters (the weights in the neural network) using backpropagation and gradient descent. These are mathematical algorithms for iteratively improving the model’s predictions. Backpropagation computes the gradient (a measure of change) of the loss function with respect to the model’s parameters, and gradient descent then uses these gradients to update the parameters.
  5. Loss Functions: GPT uses a loss function to quantify how far off its predictions are from the actual outcomes. A common choice is the cross-entropy loss, which compares the model’s predicted probability distribution with the true distribution.
  6. Attention Mechanisms: GPT uses a mechanism called “scaled dot-product attention”, which determines how much ‘attention’ should be paid to each word in the input when generating the next word. This involves taking dot products (a measure of similarity) of the input word vectors, applying a scaling factor, and then applying a softmax function to get a probability distribution. The model learns patterns in the data it’s trained on, which can include syntactic (grammar-based) and semantic (meaning-based) patterns, but it does so in a fundamentally statistical and probabilistic way.