Understanding the 4 Main Approaches to LLM…

I'd maybe describe them as necessary but not sufficient baselines. For something like paper searching, you probably want a stronger emphasis on hallucinations. Maybe a RAG-style precision-recall focused type of setup. You could then also involve the LLM-as-a-judge approach for that.

Expand full comment

Abhishek Shivkumar

Thanks Sebastian. Another superb article. Sorry just wanted to clarify. Shouldn't the bulleted number 3. at the last be actually 4.?

Expand full comment

Thanks for the feedback. Do you mean the subheader "3.1 Implementing a LLM-as-a-judge..." -> "4.1 Implementing a LLM-as-a-judge..."? Yes, it should have been 4. Fixed it!

Expand full comment

Anshuman Kumar

This is a great article! I have a question that I've not seen any evals research paper address. Coming from a statistics perspective, why don't preference alignment and human evaluation methods account for person heterogeneity? This has always trumped me up. Shouldn't, what a good response look like depend on an individual's characteristics and is confounding our evaluation? So different population would prefer different types of responses to the same question, but we don't account for this in any of our methodologies.

Expand full comment

Reply (2)

Oct 5Edited

Yes, that's a good point. I think em-dashes are a good example of stylistic preferences. Personally, I was a avid em-dash user (my professor always complained about it back then); and publishers I worked with (since 2015) straight-up didn't allow me to use em-dashes. And then there's now ChatGPT's excessive usage of em-dashes is frowned upon in tech circles. On the other hand, human preference-labelers must have loved (or at least preferred) them during the RLHF stage, otherwise ChatGPT wouldn't be using them so excessively. Long story short, preference is very subjective.

Expand full comment

Anshuman Kumar

Oct 7

I wonder if as part of "LLM as a Judge", one can just ask multiple LLMs to adopt different roles/personalities, then using this pseudo-population of role-based LLMs, ask the population to either judge or provide a preference-based rating, and using the pseudo-population's characteristics, adjust for confounding bias in reward models. Sorry for the ramble, not sure if any of this makes sense lol

Expand full comment

Am I understanding correctly that you are suggesting "LLM-as-a-judge ensemble methods"? Yes, that makes sense. I.e., one could use LLMs from different model families for this, and perhaps also additionally vary the rubric.

Expand full comment

Anshuman Kumar

Oct 17Edited

Sorry for the late reply, have been juggling a lot in life. I'm not only suggesting ensembling different LLMs, but also ensembling personality types generated via LLMs. A close analogy are the WIRED YouTube video series titled, "Expert Explains One Concept in 5 Levels of Difficulty".

So, here is what I was thinking more concretely:

Step 1:

Use LLMs to adopt different roles/personalities, let's say:

1. Personality Type A (use a specific prompt to generate Type A with certain attributes)

2. Personality Type B (use a specific prompt to generate Type B with certain attributes)

...

n. Personality Type N (use a specific prompt to generate Type N with certain attributes)

Step 2:

Use a separate LLM from different model families, to generate a population (pseudo persons) of personality types A-to-N, let's say 100 people each. There should be 100*Num. Personality Type LLM instances (may not be computationally feasible to generate in parallel).

Step 3:

Now we have a population of pseudo persons with known attributes (from step 1). During RL instruction finetuning, use each of these pseudo persons to perform pairwise ranking.

Step 4:

One can now train a reward model that also uses the attributes of pseduo persons to train the LLM to account for person heterogeneity. Maybe one can even modify the DPO and GRPO objective functions to account for these attributes directly.

I love the idea of having different rubrics. Maybe one can randomize different rubric systems that each personality type pseduo person could use to minimize sensitivity on the same task (not sure if it would improve inter-annotator agreement)

Expand full comment

Tim Dingman

Oct 7

Demographic evals like this are coming. I think it's mostly been a matter of prioritization so far, e.g. if the model is still bad at instruction following (objective) then you don't worry so much about style (subjective). But now that a lot of the core capabilities are more than good enough for most users, it's time to turn to more varied preferences.

Expand full comment

Yes, right now most LLMs are built as "one-size fits all", which is maybe due to compute limitations. But I am sure there will be more prioritization like you suggest.

Expand full comment

Rushabh Doshi

Excellent article!

Expand full comment

Thanks!

Expand full comment

Great article! Thanks for sharing.

Expand full comment

Daniel J

Dec 6

Great article. I know the ideas but going into the implementations is very helpful

Expand full comment

sian cao

Oct 10

Great read, thanks for the excellent intro. By the way, this may be a typo?

> "Vice versa, if the current model loses against a lowly ranked model, it increases the rating only a little"

I think you mean wins against?

Expand full comment

Oct 10

Good catch, thanks! Should be fixed now.

Expand full comment

Katya Artemova

Great article, thank you, Sebastian! Where would you place human evaluation in your taxonomy? Do you think LLM judges should correlate with human judgement?

Expand full comment

Good point. I'd say the human part is overseeing and evaluating the evaluations. But the leaderboard is also usually human preference based. Plus, your intuition with LLM judges is spot on: I would say LLM judges evolved from human judges. If I recall correctly, the Llama 2 paper was one of the first showing both LLM and human evals and then people moved towards more LLM-based evals as human evals are harder to collect. Also, one might argue that LLM-based evals are more reproducible.

Expand full comment

Felix

Can you please teach us how to build agentic AI from scratch? Without langchain, crewAI etc.?

Expand full comment

One day! Currently still working on the engine for that (reasoning models from scratch)

Expand full comment

Tanner Davis

Awesome article! One minor typo I noticed is the heading for "LLMs as judges" should be Method 4.

Expand full comment

Oh yes, good catch. Should be "Method 4: " not "Method 3: ". Fixed it!

Expand full comment

Martin Miceli

Enlightening article Sebastian! Why not also including this article as an appendix in your last book, I believe it will be a great complementary addition to get a more holistic understanding? I’ve already read your first book (which was amazing by the way) and now reading the MEAP version of your reasoning models book - impatiently waiting for your 3rd chapter to be released.

Expand full comment