That's a nice graph you put together. Impressive what Mistral are doing - although probably a bit unfortunate for them Llama 3 70B beats out their 8x22B (although they remain faster).
Have you played with the OpenELM models? I spent three days ORPO fine-tuning them to make a video and they were so bad I had to resort to just putting a few notes in my newsletter.. Pretty disappointing how poorly documented they are (no chat template, no gguf support, no flash attention, no vllm).
Ouch, this sounds frustrating. I haven't really had a chance to do anything with the OpenELM models ... I am just using Llama 3 mostly. But sounds like I am not missing too much 😅. On the other hand, I think the OpenELM paper was great though!
Wow! Amazing article - clear that you put a lot of work into this. Subscribed!
Thanks for the kind words!
That's a nice graph you put together. Impressive what Mistral are doing - although probably a bit unfortunate for them Llama 3 70B beats out their 8x22B (although they remain faster).
Have you played with the OpenELM models? I spent three days ORPO fine-tuning them to make a video and they were so bad I had to resort to just putting a few notes in my newsletter.. Pretty disappointing how poorly documented they are (no chat template, no gguf support, no flash attention, no vllm).
Ouch, this sounds frustrating. I haven't really had a chance to do anything with the OpenELM models ... I am just using Llama 3 mostly. But sounds like I am not missing too much 😅. On the other hand, I think the OpenELM paper was great though!