32 Comments
Dec 30, 2023Liked by Sebastian Raschka, PhD

I'd add the recent Medprompt paper that demonstrated how effective prompting strategies can enable a generalized model like GPT-4 to outperform a specialized fine-tuned model such as Google's Med-PaLM https://arxiv.org/abs/2311.16452

It shows the potential we have yet to explore with such LLMs that can be applied to smaller models as well, substantially boosting their performance at a fraction of the size, cost, and latency.

Expand full comment
author

Thanks for sharing! I wish there were more details available on GPT-4 and Med-PaLM 2 training data, architecture, and model sizes. I think it's going to continue to be a back-and-forth between finetuning and using larger general models.

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

On the Bloomberg piece... It was confusing to me why Option 3 was different than Option 5. I sense that I am missed a key contrast, perhaps between full-from-scratch-training and fine-tuning. Good practical point about $100 versus $millions. 👍

PS: SUPER!!! Another most-excellent textbook from SR. I got it! Minor note... Your 45% discount was not accepted since Manning already discounts the ebook by 50%.

PSS: You are missing an opportunity with this new textbook. What about a chapter on 'Beyond Language To Multi-Modal'? The term LLM is aging; it should LxM for both pretraining inputs and generative outputs.

Expand full comment
author

Thanks for the Feedback, Richard! The difference is that Option 3 is a more sequential procedure. So you basically take a pretrained model (like Llama 2 base) and train it further on the domain specific dataset. In option 5, which Bloomberg used, you train on a mixed dataset from scratch.

Multimodal is an interesting topic but I’d say this is out of scope for this book. I have it on my idea list for potential bonus material or future editions though. Thanks for suggesting!

Good point regarding the coupon code. I think they currently have a 50% off due to the holidays. That’s a nice thing though, so it’s even cheaper for interested readers.

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

Re: multimodal... Agree that full treatment is beyond scope, deserving a text of its own. But your existing/planned topics for this book must deal with LLM training with data other than language text. True? Recently I am amazing at the combo chatGPT4+Dall-E. Was GPT4 trained with any image data? Or, are these images all Dall-E (with good prompting)? Ugh... I just UPLOADed to ChatGPT4 a clear MNIST image of '5'. Its response was: "It appears that you have uploaded an image of the number "5". This image shows the numeral in a handwritten style with a distinct cursive-like stroke, suggesting a personal touch or an individual style of writing. The number is in white against a black background, which provides a high contrast, making the number stand out clearly." How can this LLM be able to do this if it was not trained on image data? Does UPLOAD do special processing?

Expand full comment
author

Unfortunately, I don't know how GPT-4 was trained exactly, but image support in LLMs is an interesting topic. The two general approaches are

1) taking a pretrained LLM and retrofit it via finetuning to also support images (e.g., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, https://arxiv.org/abs/2304.15010)

2) Pretraining an LLM with text & image data. Examples include LLaVa (https://magazine.sebastianraschka.com/p/ai-and-open-source-in-2023) and MiniGPT-V (https://github.com/Vision-CAIR/MiniGPT-4).

But yeah, image support is out of scope for this book. The reason is that this book is about building an LLM from scratch. There will be 8 chapters, which build on each other, to implement everything from the ground up without using external libraries (except general frameworks like PyTorch).

Pretraining an LLM with image support requires an additional encoder for the image data, which requires carrying it through all the chapters, which makes everything more complicated. Hence, I think it's a better topic for a future book. E.g., "Build a Multimodal LLM from Scratch".

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

WOW, this is getting complex! 😵 What happened to the simpler days of Deep Learning, for which one balanced 5-levels of linear hidden layers?

Expand full comment
author

Yes, I too miss those days!

Expand full comment
author

Oh, and I forgot to mention Fuyu-8B (https://www.adept.ai/blog/fuyu-8b) which was released 2 months ago. Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do.

But yeah, it's still simpler to focus just on text for a from-scratch implementation. It will already be a pretty comprehensive book. Multimodal models are definitely interesting for future works!

Expand full comment

This raises a deeper issue... Why can't you think of all modal data as a sequence of tokens and generative as predicting the next token? Tokenize speech? Tokenize videos? Tokenize CAT scans?

Does it make sense... Could/should we generalize transformer architecture from 1D token sequences to nD ones?

Expand full comment
author

That's a good question. This is actually already happening in a sense. I.e. if you take a word token, a 1D integer (via the vocabulary), you project that into a say 4096-dimensional vector.

You do the same for images, videos, audio etc. I.e., in a multimodal LLM with image support, the image becomes a 4096-dimensional vector. If you have a sentence like "What does this image depict? [img]" where [img] is a placeholder for the image, then, you have a 7x4096 input for the LLM. (For simplicity, I am assuming that 1 word = 1 token plus the ? symbol.)

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

Thanks for this, especially appreciated the diagram for fine tuning models on domain specific dataset. It would be great if you can expand on that a bit in your upcoming blogs. I see these models performing increasingly well on academic datasets but I feel it's really limiting to use LLMs just through prompting to customize for a domain specific dataset. I am also reading your book (first 2 chapters) and enjoying it.

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

Thank you for all your generous contributions to my AI Learning Journey

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

Very insightful summary.

I was hoping that the Amber paper would make it for the same reasons, releasing the weights, data and methodology used to train the model.

Expand full comment
author

Thanks, I was at Neurips when the LLM360 paper came out and only skimmed it, but it’s on my list for the research papers Dec2023+Jan2024 selection!

Expand full comment
Dec 31, 2023Liked by Sebastian Raschka, PhD

FWIW, although Axis of Ordinary is the only daily Substack that I regularly read, Ahead of AI has been my favorite and only must-read Substack -- and I subscribe to over three dozen AI-related Substacks.

Keep up the great work in 2024!

Expand full comment
author

Thanks a lot, I am very flattered that my Substack is one of your favorites :).

Expand full comment
Dec 31, 2023Liked by Sebastian Raschka, PhD

I’m curious what you think of the Mamba paper(https://arxiv.org/pdf/2312.00752.pdf) and how it stacks up. It is relatively new, but has shown potential for sub-quadratic scaling

Expand full comment
author

I find it super intriguing. I think previous state-space approaches like Hyena & HyenaDNA (https://arxiv.org/abs/2306.15794) were also hot topics earlier this year. It's hard to say what the adoption will be like in the next few years (vs transformers) but it's one of the exciting spaces to watch.

Expand full comment
Dec 31, 2023Liked by Sebastian Raschka, PhD

Hi Sebastian,

Thanks for a great post again! And Happy new year 2024!

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

These are wonderful recommendations! I wonder where the LIMA paper ranks :)

Wish you a wonderful new year and thanks ever so much for all your work!

Expand full comment
author

It was on my top 20 list actually. If it wasn’t for Orca 2, I would have included it as an example for improving dataset quality

Expand full comment
Dec 30, 2023Liked by Sebastian Raschka, PhD

Fair enough :) I'd actually love to see the 10 that didn't make the cut. Considering that there was a lot of work on text-to-video, gaussian splatting, and on the SLM side models like Phi-2

Expand full comment
author

If you are interested, here's the list of 20 papers that originally came to mind (those that didn't make it into my top 10) 😊

- The False Promise of Imitating Proprietary LLMs

- CodeLlama

- Tree of Thoughts: Deliberate Problem Solving with Large Language Models

- LIMA

- Simplifying Transformer Blocks

- Training Transformers with 4-bit Integers

- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

- RWKV: Reinventing RNNs for the Transformer Era

- Cramming: Training a Language Model on a Single GPU in One Day

- HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

- The Wisdom of Hindsight makes Language Models Better Instruction Followers

- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

- LLaMA-Adapter: Efficient Fine-tuning of LLaMA

- PaLM-E: An Embodied Multimodal Language Model

- StarCoder: may the source be with you!

- Hyena Hierarchy: Towards Larger Convolutional Language Models

- Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

- Segment Everything Everywhere All at Once

- Consistency Models

- Scaling up GANs for Text-to-Image Synthesis

I didn't consider Phi-2 because it was more of a technical report, not a research paper. But Phi (Textbooks are All You Need) and Phi-1.5 (Textbooks are All You Need II) were actually interesting papers -- I discussed them in previous articles not too long ago.

Expand full comment

I missed some of these papers, so this is a great list for me to go back to :)

Expand full comment
Dec 31, 2023Liked by Sebastian Raschka, PhD

Dr. Raschka,

Have you ever had this discussion? Back in the early 1990’s I worked with a few thousand other engineers and scientists within the DoD and DOE laboratory systems. We tried to change people’s minds within our community about one little thing but we could not overcome the great weight of the uneducated but very greedy entrepreneurs who were in love with the term “Artificial Intelligence” (AI). AI sold programs. AI sucked in the investors. But there is no such thing as AI and never will be. The best that our sciences and engineering will ever do is to Mimic Intelligence (MI). The associative engine which is the brain spews out thoughts that can only be mimicked by the best of our code writers. The papers that you provided are wonderful only in so far as their authors were able to capture and articulate the intellectual products of their own minds, i.e. real intelligence.

In all of my years of work, I never met anyone who wanted to use the term MI instead of AI even though they knew that AI was a myth. Are all of us in the scientific world so greedy that we are willing to put belief systems first even when the facts are glaringly obvious? We do all the brilliant people who have the brilliant thoughts an injustice when we lead our users into believing that the codes are intelligent, even artificially.

All the best,

David

Expand full comment
author

Good points! When I started my career, we just used the term "machine learning", (which was and is the most popular way to implement what people coined "AI" in the ~1950s.)

But yeah, using the term "artificial intelligence" when talking and writing about machine learning (and deep learning, i.e., machine learning with large neural networks) is still something that feels a bit weird to me.

As of last year, "AI" happens to be the term people use to refer to said methods. I usually don't think about the "intelligence" in "artificial intelligence" too hard. For me, AI is just a name or term. Kind of like Apple is the name of a company that has nothing to do with apples.

Expand full comment
Jan 2Liked by Sebastian Raschka, PhD

Love your content, would you have any plans to post content on large multimodal models (LMMs) anytime in the near future?

Expand full comment
author

Thanks! I may cover them in the future some time but can't make any promises yet (PS: I covered LMMs in the past as well, e.g., Llama-Adapter for retrofitting finetuned LLMs, and Fuyu-8B for example)

Expand full comment
Jan 1Liked by Sebastian Raschka, PhD

Are 7B language models in the middle of their own Moore's Law-esque curve with respect to performance? It seems like more and more, new foundation models are being trained up to 7B parameters - and outclassing 70B parameter models.

I'm guessing there are big resource limitations on how frequently you can train a 70B parameter model, which makes me think we'll see more efficiency gains applied at smaller sizes.

Expand full comment
author

Yes, I think a lot of the innovation is happening via the 7B models. Besides computational resources, it could be that it's just more optimal with respect to scaling laws and the training data available to those who are training them. And another appealing fact about 7B models is that one can finetune them on a single GPU afterwards.

Expand full comment

I have developed a Kaggle notebook to Learn TPU v3.8 + Kaggle + LLM Red Teaming For 20 Hours / Week Free. Running Models on TPUs are super fast!!!

Try out the link & share - https://www.kaggle.com/code/jaycneo/gemma-tpu-llm-red-teaming-notebook-detoxio-ai/

Expand full comment