4 Comments
May 13Liked by Sebastian Raschka, PhD

Wow! Amazing article - clear that you put a lot of work into this. Subscribed!

Expand full comment
author

Thanks for the kind words!

Expand full comment
May 16Liked by Sebastian Raschka, PhD

That's a nice graph you put together. Impressive what Mistral are doing - although probably a bit unfortunate for them Llama 3 70B beats out their 8x22B (although they remain faster).

Have you played with the OpenELM models? I spent three days ORPO fine-tuning them to make a video and they were so bad I had to resort to just putting a few notes in my newsletter.. Pretty disappointing how poorly documented they are (no chat template, no gguf support, no flash attention, no vllm).

Expand full comment
author

Ouch, this sounds frustrating. I haven't really had a chance to do anything with the OpenELM models ... I am just using Llama 3 mostly. But sounds like I am not missing too much 😅. On the other hand, I think the OpenELM paper was great though!

Expand full comment