How do large language models like ChatGPT work? They might seem like magic, but at the core of the technology is a simple idea: predicting the next word using probability.
If we examined a number of books and recorded the probability of one letter following another, we would start to see some patterns. Letters like TH and SH are often seen together, and the letter Q is almost always followed by U. If we went on to calculate all of these probabilities, simply knowing the current letter would give us a good chance of predicting the next one.
While this approach would be okay, it would get many things wrong. We could improve the ability to predict the next letter if we had more information. By knowing the previous two letters, our ability to predict the third letter would increase dramatically. In fact, the more letters we know, the more likely we are to predict the word.
Up until now, this process appears to rely on probabilities rather than artificial intelligence. You might imagine how we could extend this method to include more letters and then predict words, but this would still be a matter of probabilities. So how do we transition to something that appears intelligent?
To understand this, we need to understand something called “attention.” The idea of attention is that each sequence of letters can generate a query of things it’s looking for.
For example, if the sequence is “the cat _”, the query might be looking for a verb that would fit well in the context of the sentence. The attention mechanism allows the model to “attend to” or focus on relevant parts of the input sequence, weighing their importance and using this information to generate the most probable next word.
To implement this attention mechanism, large language models use an architecture called the Transformer. The Transformer is designed to handle long-range dependencies in text, allowing the model to better understand context and generate more coherent and contextual responses.
If the start of a story mentioned that the cat has a “French sounding name,” the model would give more consideration to words relating to French ideas and to cats.
The attention mechanism works by computing a set of attention scores for each word in the input sequence. These scores represent the relevance or importance of each word in relation to the current word being processed. The higher the attention score, the more relevant the word is regarded.
To compute these attention scores, the attention mechanism uses three sets of vectors: the query, key, and value vectors. The query vector represents the current word being processed, whereas the key and value vectors represent the words in the input sequence. The attention score examines the text’s history and aids in determining how much attention should be focused on predicting different words.
The attention/transformer architecture was a major breakthrough in AI, and perhaps most surprisingly, the performance of these models continues to improve as they grow. The way to think about the size of a model is by understanding Vectors. The core ideas is that each concept in language can have a unique identifier.
Consider the concepts of Love, Height, Friendliness, and Animal, for example. Each of these attributes can range between 0 and 10, and these values are vectors. If you select a word or concept, you can then score that word based on the vectors. For instance, a Dog may be represented as Love:5, Height:4, Friendliness:7, Animal:10. You could encode Cat as Love:4, Height:3, Friendliness:5, Animal:10. You can encode any word or concept and get back its vectors. This process is called “embedding.” You can embed individual words, sentences or even entire documents. Very large models don’t just have 4 vectors/concepts, they can have billions of them. Each concept computed in a large language model has embedded a vector. There are millions that you could imagine and probably several billion that you can’t.
The amazing things is that once you’ve done these embeddings and trained the large language model then these vectors start to be comparable. As with the example above the Cat/Dog vectors are similar, but not identical. With more attributes you get more vectors to compare. What’s also interesting is that you can do math on these vectors. The vector Cat + French will give you the vector for chatte. The vector Tree + small may give you bonsai tree. As these models are trained on more language the equations for language start to appear.
Although the actual implementation of large language models has more detail and subtlety, what began as simply predicting the next word now shows how concepts, words and attention weave together to form the foundation of intelligence.
However, LLM’s are still limited. They currently require massive initial training and, unlike human brains, they lack the fluidity of learning, synthesizing new knowledge, and updating the model dynamically. Current LLM’s also lack long-term memory. The current techniques need a rather limited context window, but like all things, this is sure to evolve in the coming months and years.