15 Comments
Feb 19Liked by Sebastian Raschka, PhD

Thanks for the great overview and summary of recent research papers, this is very useful.

While reading your summary on the Mixtral 8x7B Model, I noticed what I believe to be a mistake in your model size computations though.

You write:

"In total, Mixtral 8x7B comprises 47B parameters. This means that a Mistral 7B model has 9B non-feed-forward parameters" - clearly this cannot be true, right?

It seems you took the 56B parameters one might expect from the 8x7B model and deducted the actual 47B parameters to arrive at the 9B number.

Unless I'm mistaken, the correct math should be as follows:

non-FF + FF = 7B

non-FF + 8*FF = 47B

Solving this I arrive at ~1.3B non-feed-forward parameters and ~5.7 feed-forward parameters in a Mistral 7B model, which comes out at the mentioned 47B total parameters for Mixtral 8x7B and the 13B active parameters for two active experts.

Expand full comment
author
Feb 19·edited Feb 19Author

Ah yes, that's a good point. Should be 40/7 = 5.71B parameters in the feed-forward layers of a non-MoE model. Consequently, the MoE model has 8*5.71B = 45.68B parameters in the feed forward layers, and 1.32B non-feed forward parameters.

Expand full comment
Apr 18Liked by Sebastian Raschka, PhD

In section 2. Tuning Language Models by Proxy, you make the comment under subsection "Practical Considerations": "b) It's useful when the large base model (1) is a "black box", and its internal weights are inaccessible.

However, there's a catch: the smaller models must share the same vocabulary as the larger target model. (In theory, if someone knows the vocabulary of GPT-4 and can access its logit outputs, they could create specialized GPT-4 models using this method.)"

This comment reminded me of a result in March 2024, where a couple of teams were able to get the logits and the vocabulary of ChatGPT 3.5 via API calls (Please see: https://www.youtube.com/watch?v=O_eUzrFU6eQ or https://arxiv.org/abs/2403.09539)

I'm not sure if the ChatGPT 3.5's API has been changed since this result to prevent this access, but I wanted to share it because I thought it was relevant.

----

Anyways, thanks for the awesome newsletter! I always enjoy reading your insights into sprawling field of AI research. Please keep up the good work!

Expand full comment
author

You are right knowing the tokenizer & vocab is the catch here. Whoa, super interesting paper btw, thanks for sharing. As a follow up, I also recently saw this: "Stealing Part of a Production Language Model" https://arxiv.org/abs/2403.06634

Expand full comment
Mar 31Liked by Sebastian Raschka, PhD

its so wonderful!

Expand full comment
author

Thanks!!

Expand full comment
Feb 11Liked by Sebastian Raschka, PhD

Thank you a lot for what you do! Your blog posts are really amazing. You cover the most recent and significant papers in a way that's impactful. We need more blogs like this that focus on model merging, knowledge editing and synthetic data generaiton 🙌 🙌

Expand full comment
author

Glad to hear that this is informative! Thanks for the kind words!

Expand full comment
Feb 12Liked by Sebastian Raschka, PhD

I'm interested in knowing whether we can apply model merging techniques to adapt LLM to a new low resource languages. I plan to experiment with CALM for this purpose and will share the outcomes. However, I'd appreciate your insights. Are there any particular studies that focus on adapting large language models to new languages with limited resources?

Expand full comment
author

I must admit that I really don't have any experience in working with low-resource languages and don't want to give any bad advice here. But yeah, one bottleneck is that you still need the base LLM that has been trained on such low-resource language. I was recently tinkering with Proxy-Tuning (https://lightning.ai/lightning-ai/studios/improve-llms-with-proxy-tuning) and it worked surprisingly well in terms of e.g., transferring code capabilities. In any case, regarding low-resource languages, I recently saw this thesis on arxiv that could have some interesting pointers: https://arxiv.org/abs/2401.16582

Expand full comment
Feb 10Liked by Sebastian Raschka, PhD

`For instance, an intriguing educational takeaway from the authors' training runs is that training the model for 3 epochs (instead of 1 epoch) on 1 trillion tokens is actually beneficial, despite contradicting the Chinchilla scaling laws`

Correct me if I'm wrong, but I don't think this contradicts the Chinchilla paper. It doesn't say that for a larger compute budget you don't get improvement out of more tokens for a given model size, it just says that it's not necessarily the optimal choice. Therefore in the TinyLlama case, we would expect that for a given compute, there is a better combination of model size/number of tokens (typically here, a larger model would be needed, since they use over 1T tokens), that would lead to even better performance.

Expand full comment
author

Actually, that's a good point, I agree with you here. The scaling laws is more of a "best bang for the buck" and like you said you should still expect (some) improvement. I will update that wording.

Expand full comment
Feb 4Liked by Sebastian Raschka, PhD

Does the model ratatouille method touch on the concept of capsules discussed by Geoff Hinton. Where there are capsules inside a NN, each capsule being an expert at different tasks. Which could improve generalization?

Expand full comment
author

That's an interesting point. As far as I remember, they didn't mention Hinton's capsule idea. Personally, I think that capsule networks are also quite different as they focus on spatial (geometric) relationships and feature hierarchies in vision models. It could be possible though that Hinton's capsule networks (2017) were inspired by early mixture of experts models that emerged in the 1990s though.

Expand full comment

Got it! They are slightly different. I’m always amazed by how the base theory was laid out decades ago. We are tweaking and building upon that.😃

Expand full comment