13 Comments
Dec 9, 2023Liked by Sebastian Raschka, PhD

Hi Sebastian. Eddie Siman here. Nice job.

Expand full comment
author

Thanks, Eddie!

Expand full comment
Dec 9, 2023Liked by Sebastian Raschka, PhD

Thanks - a very nice overview of recent developments and insights! More reading for me to do... :)

Expand full comment
author

Glad to hear it's useful! Happy reading!

Expand full comment
Dec 11, 2023Liked by Sebastian Raschka, PhD

Hi Sebastian, on your hallucination review.- when you say "the good news is that their approach is fully automated and, therefore, could easily be scaled to larger datasets.";

I'm not so sure.

It was simple because they used Wikipedia as the source of truth. If they want to answer questions beyond "bios" they would have to add more and more source of truth (SOT), then develop priority/voting mechanism to arbitrate between these SOTs. I don't think this scale.

Also, I wonder whether there won't be more benefits in starting the training of the LLM with a richer dataset i.e go for quality in addition to quantity or even use metadata within the model to track sources/authors.

Thanks for your work. It's really helpful.

Expand full comment
author

That's a good point. The reliance on Wikipedia is kind of the achilles heel here. The reference-free truthfulness score (Method 2) would be easier to scale but yeah, it's generally much weaker than the Wikipedia-based Method 1.

Regarding your second point, I absolutely agree regarding quality. Papers like LIMA (https://arxiv.org/abs/2305.11206) and Orca 2 (discussed in this article) show that quality really pays off. (It's ye goode olde classic machine learning saying "garbage in, garbage out")

Expand full comment
Dec 11, 2023Liked by Sebastian Raschka, PhD

Another key revelation from the Orca 2 paper is how tailored system instructions during training significantly enhance the response quality and accuracy of LLMs like GPT-4, highlighting the need for smaller models to adopt task-specific strategies rather than merely mimicking larger models.

Expand full comment
author

Yes, I would say the two key takeaways here are

1. Data quality really really matters

2. Most people also don't need a one-fits-all LLM, but specific tailored LLMs are useful, too

Expand full comment
Dec 13, 2023Liked by Sebastian Raschka, PhD

Do the smaller models lose generalisation capabilities due their size.

Like if we train a 1b param model they would memorize the training set.

But how would they behave with out of distribution data.

LoRa are interesting as well. We don’t have to fine tune the base model per se.

Expand full comment
author

That's an interesting question, and I unfortunately don't have an answer for that at the moment. I think the current problem is that for most publicly available models, we don't know the training data (Pythia is a nice counterexample), and according to the common perception, some of these datasets may have contained the evaluation benchmarks they are being evaluated on. For that, we would need a controlled study with models of different sizes (again referencing the Pythia paper here because they did a great job with that, https://arxiv.org/abs/2304.01373).

Expand full comment
Dec 10, 2023Liked by Sebastian Raschka, PhD

Awesome!

I wonder that training smaller model on reasoning steps injects some notion of planning in the model.

Expand full comment
author

It's interesting to speculate what's happening under the hood -- I think it was a nice (and already long) paper, but yeah, they barely scratched the surface in terms of the mechanisms at play. I think this opens up a lot of directions (and work ) for future follow-up studies.

Expand full comment

hmmm, "More concretely, changing "invented" to "created" would be acceptable. However, changing "Apple" to "Orange" or "Samsung" (or changing the date) would obviously be a significant error."

I don't think you intended this to be deliberately incendiary. Changing "invented" to "created" is actually accurate. They created products specced to my vision. Scrupulously accurately, to this day. I was in direct contact with Jobs for years, til shortly before he died, and my "inventions" - the fruits of my brows, powered the "second coming" of Apple. You see my design language writ large, and my nomenclature used by millions (billions?) My middle name is "Ian" and it is my preferred name. Hence the "i" prefixing devices and apps "I" "incepted"; this is a thing with me, I like to use "easter eggs" in things I incept : WWW, Wii, lots more ... Now, here is the point of my diatribe: can you use an AI to deduce what I am saying as true. I have tried, but things I know as being true can be, and are actively, being cancelled. Hijacked even. This can be done in real time, and are getting more and more realistic all the time. A scary example of such false provenance is "The Last President", and the prototype "Search and Destroy" by Skunk Anansie supposedly (but NOT) by way of Iggy Pop and the stooges. The shenanigans were strong with engineering the false provenance of these "tulpa" (and other creations) Can you see the problem? It is existential ...

Expand full comment