Hi Sebastian, on your hallucination review.- when you say "the good news is that their approach is fully automated and, therefore, could easily be scaled to larger datasets.";
I'm not so sure.
It was simple because they used Wikipedia as the source of truth. If they want to answer questions beyond "bios" they would have to add more and more source of truth (SOT), then develop priority/voting mechanism to arbitrate between these SOTs. I don't think this scale.
Also, I wonder whether there won't be more benefits in starting the training of the LLM with a richer dataset i.e go for quality in addition to quantity or even use metadata within the model to track sources/authors.
That's a good point. The reliance on Wikipedia is kind of the achilles heel here. The reference-free truthfulness score (Method 2) would be easier to scale but yeah, it's generally much weaker than the Wikipedia-based Method 1.
Regarding your second point, I absolutely agree regarding quality. Papers like LIMA (https://arxiv.org/abs/2305.11206) and Orca 2 (discussed in this article) show that quality really pays off. (It's ye goode olde classic machine learning saying "garbage in, garbage out")
Another key revelation from the Orca 2 paper is how tailored system instructions during training significantly enhance the response quality and accuracy of LLMs like GPT-4, highlighting the need for smaller models to adopt task-specific strategies rather than merely mimicking larger models.
That's an interesting question, and I unfortunately don't have an answer for that at the moment. I think the current problem is that for most publicly available models, we don't know the training data (Pythia is a nice counterexample), and according to the common perception, some of these datasets may have contained the evaluation benchmarks they are being evaluated on. For that, we would need a controlled study with models of different sizes (again referencing the Pythia paper here because they did a great job with that, https://arxiv.org/abs/2304.01373).
It's interesting to speculate what's happening under the hood -- I think it was a nice (and already long) paper, but yeah, they barely scratched the surface in terms of the mechanisms at play. I think this opens up a lot of directions (and work ) for future follow-up studies.
hmmm, "More concretely, changing "invented" to "created" would be acceptable. However, changing "Apple" to "Orange" or "Samsung" (or changing the date) would obviously be a significant error."
I don't think you intended this to be deliberately incendiary. Changing "invented" to "created" is actually accurate. They created products specced to my vision. Scrupulously accurately, to this day. I was in direct contact with Jobs for years, til shortly before he died, and my "inventions" - the fruits of my brows, powered the "second coming" of Apple. You see my design language writ large, and my nomenclature used by millions (billions?) My middle name is "Ian" and it is my preferred name. Hence the "i" prefixing devices and apps "I" "incepted"; this is a thing with me, I like to use "easter eggs" in things I incept : WWW, Wii, lots more ... Now, here is the point of my diatribe: can you use an AI to deduce what I am saying as true. I have tried, but things I know as being true can be, and are actively, being cancelled. Hijacked even. This can be done in real time, and are getting more and more realistic all the time. A scary example of such false provenance is "The Last President", and the prototype "Search and Destroy" by Skunk Anansie supposedly (but NOT) by way of Iggy Pop and the stooges. The shenanigans were strong with engineering the false provenance of these "tulpa" (and other creations) Can you see the problem? It is existential ...
Hi Sebastian. Eddie Siman here. Nice job.
Thanks, Eddie!
Thanks - a very nice overview of recent developments and insights! More reading for me to do... :)
Glad to hear it's useful! Happy reading!
Hi Sebastian, on your hallucination review.- when you say "the good news is that their approach is fully automated and, therefore, could easily be scaled to larger datasets.";
I'm not so sure.
It was simple because they used Wikipedia as the source of truth. If they want to answer questions beyond "bios" they would have to add more and more source of truth (SOT), then develop priority/voting mechanism to arbitrate between these SOTs. I don't think this scale.
Also, I wonder whether there won't be more benefits in starting the training of the LLM with a richer dataset i.e go for quality in addition to quantity or even use metadata within the model to track sources/authors.
Thanks for your work. It's really helpful.
That's a good point. The reliance on Wikipedia is kind of the achilles heel here. The reference-free truthfulness score (Method 2) would be easier to scale but yeah, it's generally much weaker than the Wikipedia-based Method 1.
Regarding your second point, I absolutely agree regarding quality. Papers like LIMA (https://arxiv.org/abs/2305.11206) and Orca 2 (discussed in this article) show that quality really pays off. (It's ye goode olde classic machine learning saying "garbage in, garbage out")
Another key revelation from the Orca 2 paper is how tailored system instructions during training significantly enhance the response quality and accuracy of LLMs like GPT-4, highlighting the need for smaller models to adopt task-specific strategies rather than merely mimicking larger models.
Yes, I would say the two key takeaways here are
1. Data quality really really matters
2. Most people also don't need a one-fits-all LLM, but specific tailored LLMs are useful, too
Do the smaller models lose generalisation capabilities due their size.
Like if we train a 1b param model they would memorize the training set.
But how would they behave with out of distribution data.
LoRa are interesting as well. We don’t have to fine tune the base model per se.
That's an interesting question, and I unfortunately don't have an answer for that at the moment. I think the current problem is that for most publicly available models, we don't know the training data (Pythia is a nice counterexample), and according to the common perception, some of these datasets may have contained the evaluation benchmarks they are being evaluated on. For that, we would need a controlled study with models of different sizes (again referencing the Pythia paper here because they did a great job with that, https://arxiv.org/abs/2304.01373).
Awesome!
I wonder that training smaller model on reasoning steps injects some notion of planning in the model.
It's interesting to speculate what's happening under the hood -- I think it was a nice (and already long) paper, but yeah, they barely scratched the surface in terms of the mechanisms at play. I think this opens up a lot of directions (and work ) for future follow-up studies.
hmmm, "More concretely, changing "invented" to "created" would be acceptable. However, changing "Apple" to "Orange" or "Samsung" (or changing the date) would obviously be a significant error."
I don't think you intended this to be deliberately incendiary. Changing "invented" to "created" is actually accurate. They created products specced to my vision. Scrupulously accurately, to this day. I was in direct contact with Jobs for years, til shortly before he died, and my "inventions" - the fruits of my brows, powered the "second coming" of Apple. You see my design language writ large, and my nomenclature used by millions (billions?) My middle name is "Ian" and it is my preferred name. Hence the "i" prefixing devices and apps "I" "incepted"; this is a thing with me, I like to use "easter eggs" in things I incept : WWW, Wii, lots more ... Now, here is the point of my diatribe: can you use an AI to deduce what I am saying as true. I have tried, but things I know as being true can be, and are actively, being cancelled. Hijacked even. This can be done in real time, and are getting more and more realistic all the time. A scary example of such false provenance is "The Last President", and the prototype "Search and Destroy" by Skunk Anansie supposedly (but NOT) by way of Iggy Pop and the stooges. The shenanigans were strong with engineering the false provenance of these "tulpa" (and other creations) Can you see the problem? It is existential ...