- Limit Lift AI
- Posts
- What the heck is an LLM (and how are they made)?
What the heck is an LLM (and how are they made)?
A very brief explainer of LLMs - from a non-technical background.
Happy Monday!
Today we’re discussing the core building block of most AI tech we interact with today.
ChatGPT, Claude, Gemini, you name it, it’s built on an LLM - or Large Language Model. (Which, sadly, means they’re not magic oracles, but rather highly advanced prediction machines trained on pretty much all of the content of the internet).
But what is it exactly? And why can it hallucinate? And should I worry about it becoming sentient and evil in the AI revolution?
I got you - but it’s a bit of a long one, so stay with me.
Note: The best explainer of this concept that I have found is Andrej Carpathy in this 3.5 hour video - so most of the credit here goes to him. What I’ve done is watch the video, break down the parts I think are most important for the everyday user, and put them here in a digestible format (but highly recommend watching the video if you have the time to spare).
Section 1: What is an LLM?
At their core, Large Language Models are super advanced prediction machines.
Basically, they learn the statistical pattern of how chunks of text (referred to as “tokens”) show up in human language. They are trained to interact with humans in the form of a useful assistant, pulling on their knowledge base (all the texts they are fed) and using their prediction machines to determine what token would be the most logical next progression.
Ok - so how does an LLM get created?
Step 1: Acquiring Knowledge:
Basically, all of the publicly available sources of text on the internet are downloaded and filtered to get rid of undesirable content (malware, spam, etc). This text is broken down to its most basic form - just words in a file.
At this stage, duplicate content is removed, along with any “Personally Identifiable Information” (i.e. don’t worry - your social security number isn’t floating around in the final data set).
Just for context, the resulting data set for could be around 45 terabytes (though this number is rapidly changing).
Ok so now we have a ton of information to input into the LLM - but this is where it gets a bit tricky. The LLM doesn’t read the raw text the way you and I do. Instead, the text needs to be converted into the “tokens” (numbers) we mentioned earlier. LLMs “think” in tokens - not words.
A token could be made up of a word, a piece of a word, or even just a character depending on the word and the tokenizer used. GPT-4, for example, has a vocabulary of about 100k tokens.
Sounds complicated - but really, it’s just the model's way of taking human language and turning it into a language it understands. For most applications, you really don’t need to worry about this at all - though understanding that LLMs operate from a different language can help you understand some of their limitations and quirks.
Summary so far - all the usable and safe content on the internet gets downloaded in a basic text format and then turned into tokens (number sequences) that the LLM can understand.
Ok great! Now we need to use this data to actually become useful.
Time for, you guessed it … Neural Network Training! Which sounds pretty cool - but is actually pretty much just a lot of math.
Basically, the model needs to learn which token (word, part of word, or letter) is most likely to come next in a given sequence. There’s a whole bunch of complicated computer stuff that goes on here, but in essence, the system is using the data inputs and programmed parameters to get increasingly more accurate at predicting sequences of tokens.
Which means - it goes from writing gibberish to semi-coherent to fully natural sounding language, converting the tokens back into language that we understand.
Just a note that this is a super expensive process demanding huge processing power (this is one of the reasons why Nvidia, a key provider of this processing power, has hit a $4T market cap).
Anyways. At this stage, the LLM has converted all texts into tokens and used some complicated programming parameters and lots of trial and error to become master language prediction machines. In this form it is referred to as a “base model.”
Step 2: Post-Training
Sadly, the base model LLM isn’t actually that useful to the end user. It needs to be trained into an assistant!
There are specific guidelines set here - generally, the LLM needs to be trained to provide helpful, truthful, and harmless answers to questions.
In order to teach it, humans (and now other already-trained LLMs) come up with prompts and are asked to provide their ideal responses to the prompts. This information gets fed into the new LLM so that it starts to learn the right way to respond to questions.
This process is becoming increasingly automated, with humans there to do quality control and checks - but in the original LLMs, this required a massive amount of manpower to generate hundreds of thousands of prompts and responses to train the LLM.
There is another step of training here called Reinforcement Learning that basically goes a level deeper, teaching the LLM the right and wrong way to solve problems and understand human preferences - but we don’t need to go too far into the weeds there.
When that’s done, the LLM starts to step into its role as a helpful assistant - armed with the text of the internet, the ability to search the web and use tools, and the processing power to analyze it all, it turns from a base model to an Assistant Model (e.g. ChatGPT).
So what happens when you’re talking to it?
You give it a prompt (a “prefix of tokens”) - remember, you’ll prompt it in English, and it’ll convert that to tokens it understands
It uses its probability prediction to determine what the next token should be
A token is provided based on these probabilities (i.e. the answer).
This explains why asking the same question twice to a chatbot like ChatGPT will yield different answers - it selects one of the more probable tokens as a response and feeds it back to you.
And there we have it! A verryyy simplified version of the LLM creation process. There’s obviously a lot more nuance than this, and the training and advancement of these models is continuous, but now you at least have a base level understanding of the magic behind the tools.
So what’s the deal with hallucinations?
To be blunt - LLMs sometimes just make shit up when asked about information that either isn’t in their training data (or they weren’t taught how to answer).
E.g. if an LLM was never given any examples of responding “I don’t know” when asked a question, it will do its best to just synthesize something from the data it does have into something that it thinks “sounds right” as an answer. (Like a person who can’t admit when they don’t know something, and instead confidently make up something that sounds plausible).
A few ways hallucination can be mitigated:
Proper training (show it that it can say “I’m sorry, I don’t have the answer”)
Strong reinforcement training (training it to complete calculations and solve problems in a way that most efficiently uses tokens)
Providing it access to real-time information (the internet) and tools (coding languages etc.)
There are some other quirks in the LLM models to be aware of (like how they’re not very good at counting, or spelling, and have inconsistent common sense) - but just knowing to looking for all of these shortfalls will save you a lot of trouble.
They will continue to advance - but at this stage, fact-checking and critical thinking need to remain a big part of any interaction with an LLM.
Finally … do I need to worry about them becoming sentient?
In short, no. LLMs do not have any “knowledge of self” - they boot up, process tokens, and then shut off for each interaction - so at least for now, you are not under threat from a sentient LLM Agent. (Though I do always say please and thank you to them……..just in case).
That’s all for today - thanks for sticking with me. Shoot me a response if you made it this far, I’d love to hear what you thought, what questions you have, and if you now have a better understanding of LLMs!
I’ll be back in your inbox Thursday! Until then, have a killer week.