Several people asked me to dive a bit deeper into large language model (LLM) jargon and explain some of the more technical terms we nowadays take for granted. This includes references to "encoder-style" and "decoder-style" LLMs. What do these terms mean?
This LLM encoder/decoder stuff messes with my mind! There is something fundamental here that I'm not getting. HELP... 🤔 I have been fascinated with autoencoders, which take an example from feature space and ENCODE it into point in latent space and then DECODE it back into a reconstructed example in feature space, thus allowing a reconstruction loss to be calculated. [ref: Python ML 3Ed, Chap 17]
1) Should LLM decoders be called 'generators' like in GANs?
2) That single line that connects LLM encoder to its decoder... Is that the same data that one receives as an embedding from the LLM API?
3) For a decoder-only LLM, is its input always an embedding vector? Or, where do the model weights come from?
4) Is it possible to take an LLM embedding, reconstruct its initial input, and calculate the reconstruction loss? If true, this would enable us to map the fine (manifold) structures in these mysterious LLM latent spaces. Loved your old examples of putting/removing smiles on celebrity faces. Like to find a few hallucinations lurking in LLM latent spaces! 😮
> 1) Should LLM decoders be called 'generators' like in GANs?
You could think of them like that. That's also why people refer to LLMs as a form of "generative AI". But I think the term "decoder" is quite good here. E.g., if you think of autoencoders, there's an encoder and a decoder module. The encoder encodes given inputs into an embedding space, and the decoder takes it back into the original space. That's similar for the original transformer architecture. And decoder-only LLMs are basically based on that module.
> 2) That single line that connects LLM encoder to its decoder... Is that the same data that one receives as an embedding from the LLM API?
It depends of course on the API, but I think for GPT-4 like LLMs, that would be more like the last layer before converting the embeddings back into words.
> 3) For a decoder-only LLM, is its input always an embedding vector? Or, where do the model weights come from?
Yes, but the embedding layer is only one of many layers containing weights. There are many fully connected (i.e. multilayer perceptron) layers and the self-attention modules themselves.
> 4) Is it possible to take an LLM embedding, reconstruct its initial input, and calculate the reconstruction loss?
Kind of, yes! If these embeddings are the outputs of the final layer, then these are the logits for each word. If you then apply a softmax function, you can think of them as class-membership probabilities for each word. To calculate the cross-entropy loss you'd need to know the vocabulary that was used to train the LLM though.
Actually, all these points will be answered in my "Build a Large Language Model From Scratch" book where I implement and explain each part step-by-step so that you can follow the data through the LLM exactly. (Nothing against using LLM APIs or libraries -- they are actually great for productivity. But they abstract so much away that LLMs become black boxes; I'm hoping to address that in my book.)
I agree the term encoder / decoder is overloaded since almost all architectures would essentially perform the encoding / decoding as a function. Engineers are not good at naming not only vars after all 🤣
thanks a lot for this. I'm actually at the point where you left off in your comment below, where I'm using an open-source API layer on top of GPT to piece together how it all works, and how to get some short-term gratification building my own components on top of it.
But I got this far without even knowing that GPT is decoder-only, until today!
My first steps into machine learning were with the encoder-decoder architecture of face-swapping models, so I'd assumed LLMs were built with the same architecture.
Nice, glad that this was useful! And yeah, it's fascinating that a decoder-only architecture works so well. Even for Seq2Seq tasks like language translation for which people originally employed encoder-decoder models.
This LLM encoder/decoder stuff messes with my mind! There is something fundamental here that I'm not getting. HELP... 🤔 I have been fascinated with autoencoders, which take an example from feature space and ENCODE it into point in latent space and then DECODE it back into a reconstructed example in feature space, thus allowing a reconstruction loss to be calculated. [ref: Python ML 3Ed, Chap 17]
1) Should LLM decoders be called 'generators' like in GANs?
2) That single line that connects LLM encoder to its decoder... Is that the same data that one receives as an embedding from the LLM API?
3) For a decoder-only LLM, is its input always an embedding vector? Or, where do the model weights come from?
4) Is it possible to take an LLM embedding, reconstruct its initial input, and calculate the reconstruction loss? If true, this would enable us to map the fine (manifold) structures in these mysterious LLM latent spaces. Loved your old examples of putting/removing smiles on celebrity faces. Like to find a few hallucinations lurking in LLM latent spaces! 😮
Hey Richard, this is a a good point:
> 1) Should LLM decoders be called 'generators' like in GANs?
You could think of them like that. That's also why people refer to LLMs as a form of "generative AI". But I think the term "decoder" is quite good here. E.g., if you think of autoencoders, there's an encoder and a decoder module. The encoder encodes given inputs into an embedding space, and the decoder takes it back into the original space. That's similar for the original transformer architecture. And decoder-only LLMs are basically based on that module.
> 2) That single line that connects LLM encoder to its decoder... Is that the same data that one receives as an embedding from the LLM API?
It depends of course on the API, but I think for GPT-4 like LLMs, that would be more like the last layer before converting the embeddings back into words.
> 3) For a decoder-only LLM, is its input always an embedding vector? Or, where do the model weights come from?
Yes, but the embedding layer is only one of many layers containing weights. There are many fully connected (i.e. multilayer perceptron) layers and the self-attention modules themselves.
> 4) Is it possible to take an LLM embedding, reconstruct its initial input, and calculate the reconstruction loss?
Kind of, yes! If these embeddings are the outputs of the final layer, then these are the logits for each word. If you then apply a softmax function, you can think of them as class-membership probabilities for each word. To calculate the cross-entropy loss you'd need to know the vocabulary that was used to train the LLM though.
Actually, all these points will be answered in my "Build a Large Language Model From Scratch" book where I implement and explain each part step-by-step so that you can follow the data through the LLM exactly. (Nothing against using LLM APIs or libraries -- they are actually great for productivity. But they abstract so much away that LLMs become black boxes; I'm hoping to address that in my book.)
I agree the term encoder / decoder is overloaded since almost all architectures would essentially perform the encoding / decoding as a function. Engineers are not good at naming not only vars after all 🤣
thanks a lot for this. I'm actually at the point where you left off in your comment below, where I'm using an open-source API layer on top of GPT to piece together how it all works, and how to get some short-term gratification building my own components on top of it.
But I got this far without even knowing that GPT is decoder-only, until today!
My first steps into machine learning were with the encoder-decoder architecture of face-swapping models, so I'd assumed LLMs were built with the same architecture.
Nice, glad that this was useful! And yeah, it's fascinating that a decoder-only architecture works so well. Even for Seq2Seq tasks like language translation for which people originally employed encoder-decoder models.