This year has felt distinctly different. I've been working in, on, and with machine learning and AI for over a decade, yet I can't recall a time when these fields were as popular and rapidly evolving as they have been this year. To conclude an eventful 2023 in machine learning and AI research, I'm excited to share 10 noteworthy papers I've read this year. My personal focus has been more on large language models, so you'll find a heavier emphasis on large language model (LLM) papers than computer vision papers this year.
I'd add the recent Medprompt paper that demonstrated how effective prompting strategies can enable a generalized model like GPT-4 to outperform a specialized fine-tuned model such as Google's Med-PaLM https://arxiv.org/abs/2311.16452
It shows the potential we have yet to explore with such LLMs that can be applied to smaller models as well, substantially boosting their performance at a fraction of the size, cost, and latency.
On the Bloomberg piece... It was confusing to me why Option 3 was different than Option 5. I sense that I am missed a key contrast, perhaps between full-from-scratch-training and fine-tuning. Good practical point about $100 versus $millions. 👍
PS: SUPER!!! Another most-excellent textbook from SR. I got it! Minor note... Your 45% discount was not accepted since Manning already discounts the ebook by 50%.
PSS: You are missing an opportunity with this new textbook. What about a chapter on 'Beyond Language To Multi-Modal'? The term LLM is aging; it should LxM for both pretraining inputs and generative outputs.
Thanks for this, especially appreciated the diagram for fine tuning models on domain specific dataset. It would be great if you can expand on that a bit in your upcoming blogs. I see these models performing increasingly well on academic datasets but I feel it's really limiting to use LLMs just through prompting to customize for a domain specific dataset. I am also reading your book (first 2 chapters) and enjoying it.
FWIW, although Axis of Ordinary is the only daily Substack that I regularly read, Ahead of AI has been my favorite and only must-read Substack -- and I subscribe to over three dozen AI-related Substacks.
I’m curious what you think of the Mamba paper(https://arxiv.org/pdf/2312.00752.pdf) and how it stacks up. It is relatively new, but has shown potential for sub-quadratic scaling
Have you ever had this discussion? Back in the early 1990’s I worked with a few thousand other engineers and scientists within the DoD and DOE laboratory systems. We tried to change people’s minds within our community about one little thing but we could not overcome the great weight of the uneducated but very greedy entrepreneurs who were in love with the term “Artificial Intelligence” (AI). AI sold programs. AI sucked in the investors. But there is no such thing as AI and never will be. The best that our sciences and engineering will ever do is to Mimic Intelligence (MI). The associative engine which is the brain spews out thoughts that can only be mimicked by the best of our code writers. The papers that you provided are wonderful only in so far as their authors were able to capture and articulate the intellectual products of their own minds, i.e. real intelligence.
In all of my years of work, I never met anyone who wanted to use the term MI instead of AI even though they knew that AI was a myth. Are all of us in the scientific world so greedy that we are willing to put belief systems first even when the facts are glaringly obvious? We do all the brilliant people who have the brilliant thoughts an injustice when we lead our users into believing that the codes are intelligent, even artificially.
Are 7B language models in the middle of their own Moore's Law-esque curve with respect to performance? It seems like more and more, new foundation models are being trained up to 7B parameters - and outclassing 70B parameter models.
I'm guessing there are big resource limitations on how frequently you can train a 70B parameter model, which makes me think we'll see more efficiency gains applied at smaller sizes.
I'd add the recent Medprompt paper that demonstrated how effective prompting strategies can enable a generalized model like GPT-4 to outperform a specialized fine-tuned model such as Google's Med-PaLM https://arxiv.org/abs/2311.16452
It shows the potential we have yet to explore with such LLMs that can be applied to smaller models as well, substantially boosting their performance at a fraction of the size, cost, and latency.
On the Bloomberg piece... It was confusing to me why Option 3 was different than Option 5. I sense that I am missed a key contrast, perhaps between full-from-scratch-training and fine-tuning. Good practical point about $100 versus $millions. 👍
PS: SUPER!!! Another most-excellent textbook from SR. I got it! Minor note... Your 45% discount was not accepted since Manning already discounts the ebook by 50%.
PSS: You are missing an opportunity with this new textbook. What about a chapter on 'Beyond Language To Multi-Modal'? The term LLM is aging; it should LxM for both pretraining inputs and generative outputs.
Thanks for this, especially appreciated the diagram for fine tuning models on domain specific dataset. It would be great if you can expand on that a bit in your upcoming blogs. I see these models performing increasingly well on academic datasets but I feel it's really limiting to use LLMs just through prompting to customize for a domain specific dataset. I am also reading your book (first 2 chapters) and enjoying it.
Thank you for all your generous contributions to my AI Learning Journey
Very insightful summary.
I was hoping that the Amber paper would make it for the same reasons, releasing the weights, data and methodology used to train the model.
FWIW, although Axis of Ordinary is the only daily Substack that I regularly read, Ahead of AI has been my favorite and only must-read Substack -- and I subscribe to over three dozen AI-related Substacks.
Keep up the great work in 2024!
I’m curious what you think of the Mamba paper(https://arxiv.org/pdf/2312.00752.pdf) and how it stacks up. It is relatively new, but has shown potential for sub-quadratic scaling
Hi Sebastian,
Thanks for a great post again! And Happy new year 2024!
These are wonderful recommendations! I wonder where the LIMA paper ranks :)
Wish you a wonderful new year and thanks ever so much for all your work!
Dr. Raschka,
Have you ever had this discussion? Back in the early 1990’s I worked with a few thousand other engineers and scientists within the DoD and DOE laboratory systems. We tried to change people’s minds within our community about one little thing but we could not overcome the great weight of the uneducated but very greedy entrepreneurs who were in love with the term “Artificial Intelligence” (AI). AI sold programs. AI sucked in the investors. But there is no such thing as AI and never will be. The best that our sciences and engineering will ever do is to Mimic Intelligence (MI). The associative engine which is the brain spews out thoughts that can only be mimicked by the best of our code writers. The papers that you provided are wonderful only in so far as their authors were able to capture and articulate the intellectual products of their own minds, i.e. real intelligence.
In all of my years of work, I never met anyone who wanted to use the term MI instead of AI even though they knew that AI was a myth. Are all of us in the scientific world so greedy that we are willing to put belief systems first even when the facts are glaringly obvious? We do all the brilliant people who have the brilliant thoughts an injustice when we lead our users into believing that the codes are intelligent, even artificially.
All the best,
David
Love your content, would you have any plans to post content on large multimodal models (LMMs) anytime in the near future?
Are 7B language models in the middle of their own Moore's Law-esque curve with respect to performance? It seems like more and more, new foundation models are being trained up to 7B parameters - and outclassing 70B parameter models.
I'm guessing there are big resource limitations on how frequently you can train a 70B parameter model, which makes me think we'll see more efficiency gains applied at smaller sizes.
I have developed a Kaggle notebook to Learn TPU v3.8 + Kaggle + LLM Red Teaming For 20 Hours / Week Free. Running Models on TPUs are super fast!!!
Try out the link & share - https://www.kaggle.com/code/jaycneo/gemma-tpu-llm-red-teaming-notebook-detoxio-ai/