This year has felt distinctly different. I've been working in, on, and with machine learning and AI for over a decade, yet I can't recall a time when these fields were as popular and rapidly evolving as they have been this year. To conclude an eventful 2023 in machine learning and AI research, I'm excited to share 10 noteworthy papers I've read this year. My personal focus has been more on large language models, so you'll find a heavier emphasis on large language model (LLM) papers than computer vision papers this year.
I'd add the recent Medprompt paper that demonstrated how effective prompting strategies can enable a generalized model like GPT-4 to outperform a specialized fine-tuned model such as Google's Med-PaLM https://arxiv.org/abs/2311.16452
It shows the potential we have yet to explore with such LLMs that can be applied to smaller models as well, substantially boosting their performance at a fraction of the size, cost, and latency.
Thanks for sharing! I wish there were more details available on GPT-4 and Med-PaLM 2 training data, architecture, and model sizes. I think it's going to continue to be a back-and-forth between finetuning and using larger general models.
On the Bloomberg piece... It was confusing to me why Option 3 was different than Option 5. I sense that I am missed a key contrast, perhaps between full-from-scratch-training and fine-tuning. Good practical point about $100 versus $millions. 👍
PS: SUPER!!! Another most-excellent textbook from SR. I got it! Minor note... Your 45% discount was not accepted since Manning already discounts the ebook by 50%.
PSS: You are missing an opportunity with this new textbook. What about a chapter on 'Beyond Language To Multi-Modal'? The term LLM is aging; it should LxM for both pretraining inputs and generative outputs.
Thanks for the Feedback, Richard! The difference is that Option 3 is a more sequential procedure. So you basically take a pretrained model (like Llama 2 base) and train it further on the domain specific dataset. In option 5, which Bloomberg used, you train on a mixed dataset from scratch.
Multimodal is an interesting topic but I’d say this is out of scope for this book. I have it on my idea list for potential bonus material or future editions though. Thanks for suggesting!
Good point regarding the coupon code. I think they currently have a 50% off due to the holidays. That’s a nice thing though, so it’s even cheaper for interested readers.
Re: multimodal... Agree that full treatment is beyond scope, deserving a text of its own. But your existing/planned topics for this book must deal with LLM training with data other than language text. True? Recently I am amazing at the combo chatGPT4+Dall-E. Was GPT4 trained with any image data? Or, are these images all Dall-E (with good prompting)? Ugh... I just UPLOADed to ChatGPT4 a clear MNIST image of '5'. Its response was: "It appears that you have uploaded an image of the number "5". This image shows the numeral in a handwritten style with a distinct cursive-like stroke, suggesting a personal touch or an individual style of writing. The number is in white against a black background, which provides a high contrast, making the number stand out clearly." How can this LLM be able to do this if it was not trained on image data? Does UPLOAD do special processing?
Unfortunately, I don't know how GPT-4 was trained exactly, but image support in LLMs is an interesting topic. The two general approaches are
1) taking a pretrained LLM and retrofit it via finetuning to also support images (e.g., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, https://arxiv.org/abs/2304.15010)
But yeah, image support is out of scope for this book. The reason is that this book is about building an LLM from scratch. There will be 8 chapters, which build on each other, to implement everything from the ground up without using external libraries (except general frameworks like PyTorch).
Pretraining an LLM with image support requires an additional encoder for the image data, which requires carrying it through all the chapters, which makes everything more complicated. Hence, I think it's a better topic for a future book. E.g., "Build a Multimodal LLM from Scratch".
Oh, and I forgot to mention Fuyu-8B (https://www.adept.ai/blog/fuyu-8b) which was released 2 months ago. Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do.
But yeah, it's still simpler to focus just on text for a from-scratch implementation. It will already be a pretty comprehensive book. Multimodal models are definitely interesting for future works!
This raises a deeper issue... Why can't you think of all modal data as a sequence of tokens and generative as predicting the next token? Tokenize speech? Tokenize videos? Tokenize CAT scans?
Does it make sense... Could/should we generalize transformer architecture from 1D token sequences to nD ones?
That's a good question. This is actually already happening in a sense. I.e. if you take a word token, a 1D integer (via the vocabulary), you project that into a say 4096-dimensional vector.
You do the same for images, videos, audio etc. I.e., in a multimodal LLM with image support, the image becomes a 4096-dimensional vector. If you have a sentence like "What does this image depict? [img]" where [img] is a placeholder for the image, then, you have a 7x4096 input for the LLM. (For simplicity, I am assuming that 1 word = 1 token plus the ? symbol.)
Thanks for this, especially appreciated the diagram for fine tuning models on domain specific dataset. It would be great if you can expand on that a bit in your upcoming blogs. I see these models performing increasingly well on academic datasets but I feel it's really limiting to use LLMs just through prompting to customize for a domain specific dataset. I am also reading your book (first 2 chapters) and enjoying it.
FWIW, although Axis of Ordinary is the only daily Substack that I regularly read, Ahead of AI has been my favorite and only must-read Substack -- and I subscribe to over three dozen AI-related Substacks.
I’m curious what you think of the Mamba paper(https://arxiv.org/pdf/2312.00752.pdf) and how it stacks up. It is relatively new, but has shown potential for sub-quadratic scaling
I find it super intriguing. I think previous state-space approaches like Hyena & HyenaDNA (https://arxiv.org/abs/2306.15794) were also hot topics earlier this year. It's hard to say what the adoption will be like in the next few years (vs transformers) but it's one of the exciting spaces to watch.
Fair enough :) I'd actually love to see the 10 that didn't make the cut. Considering that there was a lot of work on text-to-video, gaussian splatting, and on the SLM side models like Phi-2
If you are interested, here's the list of 20 papers that originally came to mind (those that didn't make it into my top 10) 😊
- The False Promise of Imitating Proprietary LLMs
- CodeLlama
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- LIMA
- Simplifying Transformer Blocks
- Training Transformers with 4-bit Integers
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- RWKV: Reinventing RNNs for the Transformer Era
- Cramming: Training a Language Model on a Single GPU in One Day
- HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
- The Wisdom of Hindsight makes Language Models Better Instruction Followers
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
- LLaMA-Adapter: Efficient Fine-tuning of LLaMA
- PaLM-E: An Embodied Multimodal Language Model
- StarCoder: may the source be with you!
- Hyena Hierarchy: Towards Larger Convolutional Language Models
- Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
- Segment Everything Everywhere All at Once
- Consistency Models
- Scaling up GANs for Text-to-Image Synthesis
I didn't consider Phi-2 because it was more of a technical report, not a research paper. But Phi (Textbooks are All You Need) and Phi-1.5 (Textbooks are All You Need II) were actually interesting papers -- I discussed them in previous articles not too long ago.
Thanks! I may cover them in the future some time but can't make any promises yet (PS: I covered LMMs in the past as well, e.g., Llama-Adapter for retrofitting finetuned LLMs, and Fuyu-8B for example)
Are 7B language models in the middle of their own Moore's Law-esque curve with respect to performance? It seems like more and more, new foundation models are being trained up to 7B parameters - and outclassing 70B parameter models.
I'm guessing there are big resource limitations on how frequently you can train a 70B parameter model, which makes me think we'll see more efficiency gains applied at smaller sizes.
Yes, I think a lot of the innovation is happening via the 7B models. Besides computational resources, it could be that it's just more optimal with respect to scaling laws and the training data available to those who are training them. And another appealing fact about 7B models is that one can finetune them on a single GPU afterwards.
Good points! When I started my career, we just used the term "machine learning", (which was and is the most popular way to implement what people coined "AI" in the ~1950s.)
But yeah, using the term "artificial intelligence" when talking and writing about machine learning (and deep learning, i.e., machine learning with large neural networks) is still something that feels a bit weird to me.
As of last year, "AI" happens to be the term people use to refer to said methods. I usually don't think about the "intelligence" in "artificial intelligence" too hard. For me, AI is just a name or term. Kind of like Apple is the name of a company that has nothing to do with apples.
I'd add the recent Medprompt paper that demonstrated how effective prompting strategies can enable a generalized model like GPT-4 to outperform a specialized fine-tuned model such as Google's Med-PaLM https://arxiv.org/abs/2311.16452
It shows the potential we have yet to explore with such LLMs that can be applied to smaller models as well, substantially boosting their performance at a fraction of the size, cost, and latency.
Thanks for sharing! I wish there were more details available on GPT-4 and Med-PaLM 2 training data, architecture, and model sizes. I think it's going to continue to be a back-and-forth between finetuning and using larger general models.
On the Bloomberg piece... It was confusing to me why Option 3 was different than Option 5. I sense that I am missed a key contrast, perhaps between full-from-scratch-training and fine-tuning. Good practical point about $100 versus $millions. 👍
PS: SUPER!!! Another most-excellent textbook from SR. I got it! Minor note... Your 45% discount was not accepted since Manning already discounts the ebook by 50%.
PSS: You are missing an opportunity with this new textbook. What about a chapter on 'Beyond Language To Multi-Modal'? The term LLM is aging; it should LxM for both pretraining inputs and generative outputs.
Thanks for the Feedback, Richard! The difference is that Option 3 is a more sequential procedure. So you basically take a pretrained model (like Llama 2 base) and train it further on the domain specific dataset. In option 5, which Bloomberg used, you train on a mixed dataset from scratch.
Multimodal is an interesting topic but I’d say this is out of scope for this book. I have it on my idea list for potential bonus material or future editions though. Thanks for suggesting!
Good point regarding the coupon code. I think they currently have a 50% off due to the holidays. That’s a nice thing though, so it’s even cheaper for interested readers.
Re: multimodal... Agree that full treatment is beyond scope, deserving a text of its own. But your existing/planned topics for this book must deal with LLM training with data other than language text. True? Recently I am amazing at the combo chatGPT4+Dall-E. Was GPT4 trained with any image data? Or, are these images all Dall-E (with good prompting)? Ugh... I just UPLOADed to ChatGPT4 a clear MNIST image of '5'. Its response was: "It appears that you have uploaded an image of the number "5". This image shows the numeral in a handwritten style with a distinct cursive-like stroke, suggesting a personal touch or an individual style of writing. The number is in white against a black background, which provides a high contrast, making the number stand out clearly." How can this LLM be able to do this if it was not trained on image data? Does UPLOAD do special processing?
Unfortunately, I don't know how GPT-4 was trained exactly, but image support in LLMs is an interesting topic. The two general approaches are
1) taking a pretrained LLM and retrofit it via finetuning to also support images (e.g., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, https://arxiv.org/abs/2304.15010)
2) Pretraining an LLM with text & image data. Examples include LLaVa (https://magazine.sebastianraschka.com/p/ai-and-open-source-in-2023) and MiniGPT-V (https://github.com/Vision-CAIR/MiniGPT-4).
But yeah, image support is out of scope for this book. The reason is that this book is about building an LLM from scratch. There will be 8 chapters, which build on each other, to implement everything from the ground up without using external libraries (except general frameworks like PyTorch).
Pretraining an LLM with image support requires an additional encoder for the image data, which requires carrying it through all the chapters, which makes everything more complicated. Hence, I think it's a better topic for a future book. E.g., "Build a Multimodal LLM from Scratch".
WOW, this is getting complex! 😵 What happened to the simpler days of Deep Learning, for which one balanced 5-levels of linear hidden layers?
Yes, I too miss those days!
Oh, and I forgot to mention Fuyu-8B (https://www.adept.ai/blog/fuyu-8b) which was released 2 months ago. Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do.
But yeah, it's still simpler to focus just on text for a from-scratch implementation. It will already be a pretty comprehensive book. Multimodal models are definitely interesting for future works!
This raises a deeper issue... Why can't you think of all modal data as a sequence of tokens and generative as predicting the next token? Tokenize speech? Tokenize videos? Tokenize CAT scans?
Does it make sense... Could/should we generalize transformer architecture from 1D token sequences to nD ones?
That's a good question. This is actually already happening in a sense. I.e. if you take a word token, a 1D integer (via the vocabulary), you project that into a say 4096-dimensional vector.
You do the same for images, videos, audio etc. I.e., in a multimodal LLM with image support, the image becomes a 4096-dimensional vector. If you have a sentence like "What does this image depict? [img]" where [img] is a placeholder for the image, then, you have a 7x4096 input for the LLM. (For simplicity, I am assuming that 1 word = 1 token plus the ? symbol.)
Thanks for this, especially appreciated the diagram for fine tuning models on domain specific dataset. It would be great if you can expand on that a bit in your upcoming blogs. I see these models performing increasingly well on academic datasets but I feel it's really limiting to use LLMs just through prompting to customize for a domain specific dataset. I am also reading your book (first 2 chapters) and enjoying it.
Thank you for all your generous contributions to my AI Learning Journey
Very insightful summary.
I was hoping that the Amber paper would make it for the same reasons, releasing the weights, data and methodology used to train the model.
Thanks, I was at Neurips when the LLM360 paper came out and only skimmed it, but it’s on my list for the research papers Dec2023+Jan2024 selection!
FWIW, although Axis of Ordinary is the only daily Substack that I regularly read, Ahead of AI has been my favorite and only must-read Substack -- and I subscribe to over three dozen AI-related Substacks.
Keep up the great work in 2024!
Thanks a lot, I am very flattered that my Substack is one of your favorites :).
I’m curious what you think of the Mamba paper(https://arxiv.org/pdf/2312.00752.pdf) and how it stacks up. It is relatively new, but has shown potential for sub-quadratic scaling
I find it super intriguing. I think previous state-space approaches like Hyena & HyenaDNA (https://arxiv.org/abs/2306.15794) were also hot topics earlier this year. It's hard to say what the adoption will be like in the next few years (vs transformers) but it's one of the exciting spaces to watch.
Hi Sebastian,
Thanks for a great post again! And Happy new year 2024!
These are wonderful recommendations! I wonder where the LIMA paper ranks :)
Wish you a wonderful new year and thanks ever so much for all your work!
It was on my top 20 list actually. If it wasn’t for Orca 2, I would have included it as an example for improving dataset quality
Fair enough :) I'd actually love to see the 10 that didn't make the cut. Considering that there was a lot of work on text-to-video, gaussian splatting, and on the SLM side models like Phi-2
If you are interested, here's the list of 20 papers that originally came to mind (those that didn't make it into my top 10) 😊
- The False Promise of Imitating Proprietary LLMs
- CodeLlama
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- LIMA
- Simplifying Transformer Blocks
- Training Transformers with 4-bit Integers
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- RWKV: Reinventing RNNs for the Transformer Era
- Cramming: Training a Language Model on a Single GPU in One Day
- HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
- The Wisdom of Hindsight makes Language Models Better Instruction Followers
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
- LLaMA-Adapter: Efficient Fine-tuning of LLaMA
- PaLM-E: An Embodied Multimodal Language Model
- StarCoder: may the source be with you!
- Hyena Hierarchy: Towards Larger Convolutional Language Models
- Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
- Segment Everything Everywhere All at Once
- Consistency Models
- Scaling up GANs for Text-to-Image Synthesis
I didn't consider Phi-2 because it was more of a technical report, not a research paper. But Phi (Textbooks are All You Need) and Phi-1.5 (Textbooks are All You Need II) were actually interesting papers -- I discussed them in previous articles not too long ago.
I missed some of these papers, so this is a great list for me to go back to :)
Love your content, would you have any plans to post content on large multimodal models (LMMs) anytime in the near future?
Thanks! I may cover them in the future some time but can't make any promises yet (PS: I covered LMMs in the past as well, e.g., Llama-Adapter for retrofitting finetuned LLMs, and Fuyu-8B for example)
Are 7B language models in the middle of their own Moore's Law-esque curve with respect to performance? It seems like more and more, new foundation models are being trained up to 7B parameters - and outclassing 70B parameter models.
I'm guessing there are big resource limitations on how frequently you can train a 70B parameter model, which makes me think we'll see more efficiency gains applied at smaller sizes.
Yes, I think a lot of the innovation is happening via the 7B models. Besides computational resources, it could be that it's just more optimal with respect to scaling laws and the training data available to those who are training them. And another appealing fact about 7B models is that one can finetune them on a single GPU afterwards.
I have developed a Kaggle notebook to Learn TPU v3.8 + Kaggle + LLM Red Teaming For 20 Hours / Week Free. Running Models on TPUs are super fast!!!
Try out the link & share - https://www.kaggle.com/code/jaycneo/gemma-tpu-llm-red-teaming-notebook-detoxio-ai/
Good points! When I started my career, we just used the term "machine learning", (which was and is the most popular way to implement what people coined "AI" in the ~1950s.)
But yeah, using the term "artificial intelligence" when talking and writing about machine learning (and deep learning, i.e., machine learning with large neural networks) is still something that feels a bit weird to me.
As of last year, "AI" happens to be the term people use to refer to said methods. I usually don't think about the "intelligence" in "artificial intelligence" too hard. For me, AI is just a name or term. Kind of like Apple is the name of a company that has nothing to do with apples.