15 Comments
Feb 19Liked by Sebastian Raschka, PhD

Thanks for the great overview and summary of recent research papers, this is very useful.

While reading your summary on the Mixtral 8x7B Model, I noticed what I believe to be a mistake in your model size computations though.

You write:

"In total, Mixtral 8x7B comprises 47B parameters. This means that a Mistral 7B model has 9B non-feed-forward parameters" - clearly this cannot be true, right?

It seems you took the 56B parameters one might expect from the 8x7B model and deducted the actual 47B parameters to arrive at the 9B number.

Unless I'm mistaken, the correct math should be as follows:

non-FF + FF = 7B

non-FF + 8*FF = 47B

Solving this I arrive at ~1.3B non-feed-forward parameters and ~5.7 feed-forward parameters in a Mistral 7B model, which comes out at the mentioned 47B total parameters for Mixtral 8x7B and the 13B active parameters for two active experts.

Expand full comment
Apr 18Liked by Sebastian Raschka, PhD

In section 2. Tuning Language Models by Proxy, you make the comment under subsection "Practical Considerations": "b) It's useful when the large base model (1) is a "black box", and its internal weights are inaccessible.

However, there's a catch: the smaller models must share the same vocabulary as the larger target model. (In theory, if someone knows the vocabulary of GPT-4 and can access its logit outputs, they could create specialized GPT-4 models using this method.)"

This comment reminded me of a result in March 2024, where a couple of teams were able to get the logits and the vocabulary of ChatGPT 3.5 via API calls (Please see: https://www.youtube.com/watch?v=O_eUzrFU6eQ or https://arxiv.org/abs/2403.09539)

I'm not sure if the ChatGPT 3.5's API has been changed since this result to prevent this access, but I wanted to share it because I thought it was relevant.

----

Anyways, thanks for the awesome newsletter! I always enjoy reading your insights into sprawling field of AI research. Please keep up the good work!

Expand full comment
Mar 31Liked by Sebastian Raschka, PhD

its so wonderful!

Expand full comment
Feb 11Liked by Sebastian Raschka, PhD

Thank you a lot for what you do! Your blog posts are really amazing. You cover the most recent and significant papers in a way that's impactful. We need more blogs like this that focus on model merging, knowledge editing and synthetic data generaiton 🙌 🙌

Expand full comment
Feb 10Liked by Sebastian Raschka, PhD

`For instance, an intriguing educational takeaway from the authors' training runs is that training the model for 3 epochs (instead of 1 epoch) on 1 trillion tokens is actually beneficial, despite contradicting the Chinchilla scaling laws`

Correct me if I'm wrong, but I don't think this contradicts the Chinchilla paper. It doesn't say that for a larger compute budget you don't get improvement out of more tokens for a given model size, it just says that it's not necessarily the optimal choice. Therefore in the TinyLlama case, we would expect that for a given compute, there is a better combination of model size/number of tokens (typically here, a larger model would be needed, since they use over 1T tokens), that would lead to even better performance.

Expand full comment
Feb 4Liked by Sebastian Raschka, PhD

Does the model ratatouille method touch on the concept of capsules discussed by Geoff Hinton. Where there are capsules inside a NN, each capsule being an expert at different tasks. Which could improve generalization?

Expand full comment