Thanks for the great overview and summary of recent research papers, this is very useful.
While reading your summary on the Mixtral 8x7B Model, I noticed what I believe to be a mistake in your model size computations though.
You write:
"In total, Mixtral 8x7B comprises 47B parameters. This means that a Mistral 7B model has 9B non-feed-forward parameters" - clearly this cannot be true, right?
It seems you took the 56B parameters one might expect from the 8x7B model and deducted the actual 47B parameters to arrive at the 9B number.
Unless I'm mistaken, the correct math should be as follows:
non-FF + FF = 7B
non-FF + 8*FF = 47B
Solving this I arrive at ~1.3B non-feed-forward parameters and ~5.7 feed-forward parameters in a Mistral 7B model, which comes out at the mentioned 47B total parameters for Mixtral 8x7B and the 13B active parameters for two active experts.
Ah yes, that's a good point. Should be 40/7 = 5.71B parameters in the feed-forward layers of a non-MoE model. Consequently, the MoE model has 8*5.71B = 45.68B parameters in the feed forward layers, and 1.32B non-feed forward parameters.
In section 2. Tuning Language Models by Proxy, you make the comment under subsection "Practical Considerations": "b) It's useful when the large base model (1) is a "black box", and its internal weights are inaccessible.
However, there's a catch: the smaller models must share the same vocabulary as the larger target model. (In theory, if someone knows the vocabulary of GPT-4 and can access its logit outputs, they could create specialized GPT-4 models using this method.)"
I'm not sure if the ChatGPT 3.5's API has been changed since this result to prevent this access, but I wanted to share it because I thought it was relevant.
----
Anyways, thanks for the awesome newsletter! I always enjoy reading your insights into sprawling field of AI research. Please keep up the good work!
You are right knowing the tokenizer & vocab is the catch here. Whoa, super interesting paper btw, thanks for sharing. As a follow up, I also recently saw this: "Stealing Part of a Production Language Model" https://arxiv.org/abs/2403.06634
Thank you a lot for what you do! Your blog posts are really amazing. You cover the most recent and significant papers in a way that's impactful. We need more blogs like this that focus on model merging, knowledge editing and synthetic data generaiton 🙌 🙌
I'm interested in knowing whether we can apply model merging techniques to adapt LLM to a new low resource languages. I plan to experiment with CALM for this purpose and will share the outcomes. However, I'd appreciate your insights. Are there any particular studies that focus on adapting large language models to new languages with limited resources?
I must admit that I really don't have any experience in working with low-resource languages and don't want to give any bad advice here. But yeah, one bottleneck is that you still need the base LLM that has been trained on such low-resource language. I was recently tinkering with Proxy-Tuning (https://lightning.ai/lightning-ai/studios/improve-llms-with-proxy-tuning) and it worked surprisingly well in terms of e.g., transferring code capabilities. In any case, regarding low-resource languages, I recently saw this thesis on arxiv that could have some interesting pointers: https://arxiv.org/abs/2401.16582
`For instance, an intriguing educational takeaway from the authors' training runs is that training the model for 3 epochs (instead of 1 epoch) on 1 trillion tokens is actually beneficial, despite contradicting the Chinchilla scaling laws`
Correct me if I'm wrong, but I don't think this contradicts the Chinchilla paper. It doesn't say that for a larger compute budget you don't get improvement out of more tokens for a given model size, it just says that it's not necessarily the optimal choice. Therefore in the TinyLlama case, we would expect that for a given compute, there is a better combination of model size/number of tokens (typically here, a larger model would be needed, since they use over 1T tokens), that would lead to even better performance.
Actually, that's a good point, I agree with you here. The scaling laws is more of a "best bang for the buck" and like you said you should still expect (some) improvement. I will update that wording.
Does the model ratatouille method touch on the concept of capsules discussed by Geoff Hinton. Where there are capsules inside a NN, each capsule being an expert at different tasks. Which could improve generalization?
That's an interesting point. As far as I remember, they didn't mention Hinton's capsule idea. Personally, I think that capsule networks are also quite different as they focus on spatial (geometric) relationships and feature hierarchies in vision models. It could be possible though that Hinton's capsule networks (2017) were inspired by early mixture of experts models that emerged in the 1990s though.
Thanks for the great overview and summary of recent research papers, this is very useful.
While reading your summary on the Mixtral 8x7B Model, I noticed what I believe to be a mistake in your model size computations though.
You write:
"In total, Mixtral 8x7B comprises 47B parameters. This means that a Mistral 7B model has 9B non-feed-forward parameters" - clearly this cannot be true, right?
It seems you took the 56B parameters one might expect from the 8x7B model and deducted the actual 47B parameters to arrive at the 9B number.
Unless I'm mistaken, the correct math should be as follows:
non-FF + FF = 7B
non-FF + 8*FF = 47B
Solving this I arrive at ~1.3B non-feed-forward parameters and ~5.7 feed-forward parameters in a Mistral 7B model, which comes out at the mentioned 47B total parameters for Mixtral 8x7B and the 13B active parameters for two active experts.
Ah yes, that's a good point. Should be 40/7 = 5.71B parameters in the feed-forward layers of a non-MoE model. Consequently, the MoE model has 8*5.71B = 45.68B parameters in the feed forward layers, and 1.32B non-feed forward parameters.
In section 2. Tuning Language Models by Proxy, you make the comment under subsection "Practical Considerations": "b) It's useful when the large base model (1) is a "black box", and its internal weights are inaccessible.
However, there's a catch: the smaller models must share the same vocabulary as the larger target model. (In theory, if someone knows the vocabulary of GPT-4 and can access its logit outputs, they could create specialized GPT-4 models using this method.)"
This comment reminded me of a result in March 2024, where a couple of teams were able to get the logits and the vocabulary of ChatGPT 3.5 via API calls (Please see: https://www.youtube.com/watch?v=O_eUzrFU6eQ or https://arxiv.org/abs/2403.09539)
I'm not sure if the ChatGPT 3.5's API has been changed since this result to prevent this access, but I wanted to share it because I thought it was relevant.
----
Anyways, thanks for the awesome newsletter! I always enjoy reading your insights into sprawling field of AI research. Please keep up the good work!
You are right knowing the tokenizer & vocab is the catch here. Whoa, super interesting paper btw, thanks for sharing. As a follow up, I also recently saw this: "Stealing Part of a Production Language Model" https://arxiv.org/abs/2403.06634
its so wonderful!
Thanks!!
Thank you a lot for what you do! Your blog posts are really amazing. You cover the most recent and significant papers in a way that's impactful. We need more blogs like this that focus on model merging, knowledge editing and synthetic data generaiton 🙌 🙌
Glad to hear that this is informative! Thanks for the kind words!
I'm interested in knowing whether we can apply model merging techniques to adapt LLM to a new low resource languages. I plan to experiment with CALM for this purpose and will share the outcomes. However, I'd appreciate your insights. Are there any particular studies that focus on adapting large language models to new languages with limited resources?
I must admit that I really don't have any experience in working with low-resource languages and don't want to give any bad advice here. But yeah, one bottleneck is that you still need the base LLM that has been trained on such low-resource language. I was recently tinkering with Proxy-Tuning (https://lightning.ai/lightning-ai/studios/improve-llms-with-proxy-tuning) and it worked surprisingly well in terms of e.g., transferring code capabilities. In any case, regarding low-resource languages, I recently saw this thesis on arxiv that could have some interesting pointers: https://arxiv.org/abs/2401.16582
`For instance, an intriguing educational takeaway from the authors' training runs is that training the model for 3 epochs (instead of 1 epoch) on 1 trillion tokens is actually beneficial, despite contradicting the Chinchilla scaling laws`
Correct me if I'm wrong, but I don't think this contradicts the Chinchilla paper. It doesn't say that for a larger compute budget you don't get improvement out of more tokens for a given model size, it just says that it's not necessarily the optimal choice. Therefore in the TinyLlama case, we would expect that for a given compute, there is a better combination of model size/number of tokens (typically here, a larger model would be needed, since they use over 1T tokens), that would lead to even better performance.
Actually, that's a good point, I agree with you here. The scaling laws is more of a "best bang for the buck" and like you said you should still expect (some) improvement. I will update that wording.
Does the model ratatouille method touch on the concept of capsules discussed by Geoff Hinton. Where there are capsules inside a NN, each capsule being an expert at different tasks. Which could improve generalization?
That's an interesting point. As far as I remember, they didn't mention Hinton's capsule idea. Personally, I think that capsule networks are also quite different as they focus on spatial (geometric) relationships and feature hierarchies in vision models. It could be possible though that Hinton's capsule networks (2017) were inspired by early mixture of experts models that emerged in the 1990s though.
Got it! They are slightly different. I’m always amazed by how the base theory was laid out decades ago. We are tweaking and building upon that.😃