Tackling Hallucinations, Boosting Reasoning…

Hi Sebastian. Eddie Siman here. Nice job.

Expand full comment

Thanks, Eddie!

Expand full comment

Jean-Paul

Thanks - a very nice overview of recent developments and insights! More reading for me to do... :)

Expand full comment

Glad to hear it's useful! Happy reading!

Expand full comment

DOMINIQUE C LAHAIX

Hi Sebastian, on your hallucination review.- when you say "the good news is that their approach is fully automated and, therefore, could easily be scaled to larger datasets.";

I'm not so sure.

It was simple because they used Wikipedia as the source of truth. If they want to answer questions beyond "bios" they would have to add more and more source of truth (SOT), then develop priority/voting mechanism to arbitrate between these SOTs. I don't think this scale.

Also, I wonder whether there won't be more benefits in starting the training of the LLM with a richer dataset i.e go for quality in addition to quantity or even use metadata within the model to track sources/authors.

Thanks for your work. It's really helpful.

Expand full comment

Dec 12, 2023

That's a good point. The reliance on Wikipedia is kind of the achilles heel here. The reference-free truthfulness score (Method 2) would be easier to scale but yeah, it's generally much weaker than the Wikipedia-based Method 1.

Regarding your second point, I absolutely agree regarding quality. Papers like LIMA (https://arxiv.org/abs/2305.11206) and Orca 2 (discussed in this article) show that quality really pays off. (It's ye goode olde classic machine learning saying "garbage in, garbage out")

Expand full comment

Sahar Mor

Another key revelation from the Orca 2 paper is how tailored system instructions during training significantly enhance the response quality and accuracy of LLMs like GPT-4, highlighting the need for smaller models to adopt task-specific strategies rather than merely mimicking larger models.

Expand full comment

Yes, I would say the two key takeaways here are

1. Data quality really really matters

2. Most people also don't need a one-fits-all LLM, but specific tailored LLMs are useful, too

Expand full comment

Vaibhav

Dec 13, 2023

Do the smaller models lose generalisation capabilities due their size.

Like if we train a 1b param model they would memorize the training set.

But how would they behave with out of distribution data.

LoRa are interesting as well. We don’t have to fine tune the base model per se.

Expand full comment

Dec 14, 2023

That's an interesting question, and I unfortunately don't have an answer for that at the moment. I think the current problem is that for most publicly available models, we don't know the training data (Pythia is a nice counterexample), and according to the common perception, some of these datasets may have contained the evaluation benchmarks they are being evaluated on. For that, we would need a controlled study with models of different sizes (again referencing the Pythia paper here because they did a great job with that, https://arxiv.org/abs/2304.01373).

Expand full comment

Vaibhav

Dec 10, 2023

Awesome!

I wonder that training smaller model on reasoning steps injects some notion of planning in the model.

Expand full comment

It's interesting to speculate what's happening under the hood -- I think it was a nice (and already long) paper, but yeah, they barely scratched the surface in terms of the mechanisms at play. I think this opens up a lot of directions (and work ) for future follow-up studies.

Expand full comment

Ian Walker