This is an accessible introduction to explain what ChatGPT is and how it works. It uses plain language to aim at a general audience without any technical background in machine learning.

OpenAI launched a new chatbot service called ChatGPT in November 2022, and it immediately attracted tremendous attention all over the world. It is said that ChatGPT has recorded one million users in only five days, faster than any service or product ever created before. Meanwhile, massive compliments on its implausible smartness and amazing capabilities have started to overwhelm social media. Chatbots are not new to us at all as many of us have previously experienced other such products, like Apple’s Siri and Amazon’s Alexa. However, people are astonished this time that ChatGPT is clearly different from all preceding AI products. In my opinion, ChatGPT is a revolutionary product in technological development and it is destined to change our daily lives in many fundamental ways. ChatGPT is a tipping point in our long-time pursuit for human level artificial intelligence (AI) and it has been significantly reshaping the landscape of AI research and development, both in academics and industry.

As a long-time AI researcher and an early user of ChatGPT, I shall write several blog posts to talk about ChatGPT and its impacts on our society, particularly focusing on how it may change our current post-secondary education, as well as its influences on AI research, especially in the areas of machine learning and natural language processing.

Here, as the first in a series of multiple blog posts, I will use plain language to explain what ChatGPT is and how it works. The technology behind ChatGPT is not particularly innovative on its own, as all core techniques used in ChatGPT were invented by many people in the field over the past decades (most of them are not even with OpenAI). In this post, I will explain how OpenAI has managed to take advantage of some previously-known techniques and successfully delivered a revolutionary and influential AI product that is destined to change the world.

Language Models

First of all, ChatGPT relies on a colossal deep neural network model as its core, often called GPT-3, which works as a large language model (LLM). Language models are not a new technique at all and they have been around for at least 60 years. The basic idea of language models is to have computers predict what word would follow a given partial sentence (or paragraph). This partial sentence (or paragraph) is given as input, for which language models infer what word (or words) may have a good chance to follow to eventually form a meaningful sentence. We know all words appearing in a fluent sentence, a coherent paragraph, or a logical document are highly correlated. Any legible sentence must use some relevant words and it is impossible to convey any purposeful message using some random words from a dictionary. Moreover, these correlated words must follow a certain order to ensure the sentence is legible. Of course, we often have more than one order to organize these words into a meaningful message one way or another but the order cannot be arbitrary.

Language models are built using a typical machine learning approach: we first collect a training text corpus consisting of a large number of sentences or documents, and then let the language model read through these sentences one by one to learn both the order of the words and the correlation between words. After this training process is done, the learned language model will be able to predict what word may follow any given text prompt as the language model has learned some universal information about how to organize words in natural language. More specifically, for any given text prompt (no matter whether it is short or long), language models compute a probability score (between 0 and 1) for each word in a dictionary to indicate how likely each word may appear after the given prompt. For example, given any partial English sentence, when we go through all words in a dictionary, it is easy to tell that most words are impossible to appear right after the partial sentence that would compose a grammatically correct, logical, and meaningful sentence. A good language model will give probability zero (0) to these words. Of course, we usually can find a smaller number of words (compared to the total number of words in a dictionary) which could appear right after the partial sentence with the potential to form a meaningful sentence at the end. A good language model should give non-zero probability scores to these words, e.g. 0.1, 0.2, 0.35, etc. The value of each probability should indicate how well this word fits in this context - the larger the better.

In the past, due to the limitation of computing resources as well as the amount of training data available to us, traditional language models made some strong simplifications (a.k.a. inductive biases). For example, no matter how long the partial sentence is, we only look at the few most recent words (e.g., only 1-3 words) that precede the current prediction position for next word, namely the last 1-3 words at the end of the given partial sentence within a hypothetical look-up window. In other words, these traditional language models only use the words within this look-up window to predict next word rather than the full context if the context is longer than the window. The window can be slid to the right one position at a time to predict subsequent words. The look-up window in traditional language models are often constrained to not more than three words in order to make model size manageable. Generally speaking, it is a sensible choice to use only a small number of most recent words within the look-up window for prediction if computing resources are limited since the most recent words show the strongest correlation with the current prediction position on average over many different partial sentences and contexts. The fact that the average correlation is the strongest does not mean the correlation is strong for every case. For instance, if we read a long text document, in many cases we can easily spot the reason why a particular word appears in a place is more related to another word or the other few words appearing in some quite far-away context (sometimes even a few sentences apart) but being semantically linked to that word, rather than the immediate neighboring words. Because of the assumptions adopted in the first place with regards to the look-up window size, traditional language models are completely crippled to capture this type of long-distance dependency abundant in natural language. Some earlier-generation neural networks language models improved the situation slightly, but empirical studies have also shown that they cannot effectively make use of more than 7-8 most recent words in each context for prediction. In other words, their look-up windows usually can not exceed 7-8 words in average.

Transformer

GPT-3, the language model used by ChatGPT, adopted a new neural network architecture, called the transformer, which was initially proposed by Google researchers in 2017. There is no doubt that transformer was a new architecture when it was initially proposed in 2017, but many similar or related architectures have been proposed in the literature prior to it.

There is no breakthrough or revolutionary idea behind the transformer architecture. The underlying math essentially involves multiplications and manipulations of large matrices. Transformers entail a large number of model parameters in these large matrices which need to be trained, and moreover transformers are very computationally intensive because the computational complexity of this architecture is quadratic to both input data size and model size. That is, when the model size (or input data dimension) grows N times, the number of computation steps in transformers grows N2 times. In the past, it was unwise to propose or use any quadratically-complex models and researchers usually endeavored to simplify model structure to make their models at least linear, so total computation steps would only grow N times in the above cases. However, the researchers and engineers at Google probably had plenty of CPUs and GPUs to abuse in 2017. They first proposed to use this quadratically-complex model structure for some benchmark language translation tasks and reported impressive performance gains. Immediately after that, this architecture became widely accepted by many researchers and engineers all over the world for more and more language-related tasks. At present, transformers are considered to be the dominant neural network model in almost all AI tasks, in not just language but also speech and vision related applications.

The large language model behind ChatGPT, GPT-3, adopts the same structure as Google’s original transformer but it is way bigger. Compared with GPT-3, the Google’s original transformer model is a peanut to a watermelon. Over the past few years, OpenAI has boldly scaled up the size of transformer in all possible dimensions. After several generations of development (GPT in 2018 and GPT-2 in 2019), OpenAI initially released a gigantic GPT-3 in 2020 and kept updating it until 2022. Recall that traditional language models only look at a small window of 1-3 preceding words to make predictions for the next word. GPT-3 extends its look-up window to cover at most 2048 words [a]. That is, GPT-3 can make use of up to 2048 words preceding the current position to predict the next word that may appear. A sequence of 2048 words is long, and it can easily span many sentences, many paragraphs or even several documents. Of course, GPT-3 does not treat all 2048 words in the look-up window equally, as the transformer architecture allows prediction under each different context to focus on certain words or phrases and at the same time completely ignore all other irrelevant words within the window. This is often called the attention mechanism, which is so flexible in the transformer architecture that it can automatically adjust the language model to focus on different words or different combinations of words inside its long look-up window for different contexts. Of course, the costs associated with this powerful adaptive attention mechanism are huge, including:

  • High computational complexity, becoming quadratic to input data dimension, widow size and model size;
  • A large number of parameters are necessary to cope with a rich set of contexts in natural language.

Large Language Models: GPT-3 and ChatGPT

The final version of GPT-3 contains 175 billion parameters, requiring a gigantic space of approximately 1 terabytes (1012 bytes) to simply store it. If anyone wants to make use of GPT-3 in any way, the entire model needs to be loaded into computer memory. Even today, not many computers or servers across the world can hold GPT-3 in memory. Obviously, even though GPT-3 were freely available to anyone, only a small fraction of users in the world would actually load it in memory and take advantage of it. When training this gigantic GPT-3, OpenAI used a large quantity of text scraped from the Internet. This consists of more than one trillion words of raw text scraped from the Internet over the past decades, which include almost all online books, articles, blogs, tweets, advertisements, reviews, comments, and so on. It covers pretty much all stuffs people have ever posted in the Web. It contains text in over 40 different languages and also a large quantity of computer programs/code posted on the Web along with questions, explanations and comments. One can imagine that these raw crawls contain lots of noise, garbage, and toxic content. OpenAI has done an excellent job to filter and clean up these raw crawls and eventually generate a high-quality subset to train GPT-3. This filtered subset is still very large, consisting of 300 billions words [a]. A rough estimation shows that these words can easily make up about 2-3 million books, which is almost equivalent to the entire collection of one of the largest libraries in the world. As of today, all Wikipedia pages ever created on the Web contain only 3 billion words, which accounts for only 1% of this subset. During the training process, in principle, we need to make the GPT-3 model to look at each word in this subset of 300 billion words along with its preceding words (up to 2048 words) one by one. At the same time, we must adjust all 175 billion parameters to ensure GPT-3 remembers each context in order to make a good prediction in each case. It is not hard to imagine that this training process is extremely demanding in computing resources. OpenAI reports that each training cycle takes several months to finish on a large cluster of high-end servers consisting of thousands of top-ranking CPUs and GPUs. We can easily estimate that the electricity bill alone to run these machines for several months will get up to 1-2 million US dollars. This does not include many other costs, such as purchasing or renting these machines, preparing and storing the one terabyte of training data, operational costs and salary expenses. It is also common to run several training cycles to fine-tune various settings in the training procedure to be able to deliver the best possible model. It has been reported that the total cost for OpenAI to train GPT-3 once amounts to about 12 million US dollars in total.

Once a large language model like GPT-3 is trained, a good use of it is to generate text given a text prompt. In this case, the given prompt is fed to GPT-3 as input to compute probability scores for all possible words following that prompt as the first word in a reply. It normally produces probability zeros for all incoherent words but non-zero probabilities only for a handful of words that can potentially fit there, e.g. 0.3 for word A, 0.25 for B, 0.13 for C, etc. According to these probabilities, we randomly sample one as the first word in the reply, i.e. 30% chance to be A, 25% chance as B, 13% as C and so on. Once the first word in the reply is determined, we feed the original prompt plus the first word in the reply to GPT-3 and repeat the same computation to determine the second word for the reply. This process continues until a special termination symbol is finally sampled. This explains why Chat-GPT can generate different replies in various trials even under the same prompt. As long as the large language model is well-trained over enough training corpus, all these replies will be fluent and look plausible for the provided prompt.

In addition to GPT-3, ChatGPT also requires another training step to learn another smaller model to guide GPT-3 to generate the best possible responses for many common prompts. This step requires hiring many human annotators to read responses generated by GPT-3 and manually label whether each particular response is sufficiently good for that prompt. All these human labels are used to train another model to guide ChatGPT to produce only the best responses under each prompt, which is often referred to as aligning language models to follow instructions.

The lofty cost involved has essentially wiped out academic teams and smaller industry players from this large language model business. Instead, this has become a game involving several billion-dollar industry giants. In addition to OpenAI (teamed up with Microsoft), several other large tech companies, such as Google and Meta, have been actively exploring large language models in the past few years. However, for the first time, ChatGPT has convinced many AI researchers and practitioners (including myself) that large language models are a promising direction towards artificial general intelligence (AGI). For the first time, many serious AI practitioners (not only journalists and sci-fi writers) have started to believe that AGI is not only feasible but also just around the corner.

[a] To be exact, it refers to tokens, or word fragments, rather than the regular words, but let’s ignore this minor technical detail for convenience.