25 Comments
User's avatar
Nikolaos Evangelou's avatar

Thanks for the excellent overview of LLM evaluation methods. One thing I’ve been thinking about recently is that for deep research tasks (like searching for scientific papers or searching public databases), the four methods you describe feel necessary. Do you think they are sufficient, or do we need additional evaluation layers to better capture performance in real research contexts?

Expand full comment
Sebastian Raschka, PhD's avatar

I'd maybe describe them as necessary but not sufficient baselines. For something like paper searching, you probably want a stronger emphasis on hallucinations. Maybe a RAG-style precision-recall focused type of setup. You could then also involve the LLM-as-a-judge approach for that.

Expand full comment
Abhishek Shivkumar's avatar

Thanks Sebastian. Another superb article. Sorry just wanted to clarify. Shouldn't the bulleted number 3. at the last be actually 4.?

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the feedback. Do you mean the subheader "3.1 Implementing a LLM-as-a-judge..." -> "4.1 Implementing a LLM-as-a-judge..."? Yes, it should have been 4. Fixed it!

Expand full comment
Anshuman Kumar's avatar

This is a great article! I have a question that I've not seen any evals research paper address. Coming from a statistics perspective, why don't preference alignment and human evaluation methods account for person heterogeneity? This has always trumped me up. Shouldn't, what a good response look like depend on an individual's characteristics and is confounding our evaluation? So different population would prefer different types of responses to the same question, but we don't account for this in any of our methodologies.

Expand full comment
Sebastian Raschka, PhD's avatar

Yes, that's a good point. I think em-dashes are a good example of stylistic preferences. Personally, I was a avid em-dash user (my professor always complained about it back then); and publishers I worked with (since 2015) straight-up didn't allow me to use em-dashes. And then there's now ChatGPT's excessive usage of em-dashes is frowned upon in tech circles. On the other hand, human preference-labelers must have loved (or at least preferred) them during the RLHF stage, otherwise ChatGPT wouldn't be using them so excessively. Long story short, preference is very subjective.

Expand full comment
Anshuman Kumar's avatar

I wonder if as part of "LLM as a Judge", one can just ask multiple LLMs to adopt different roles/personalities, then using this pseudo-population of role-based LLMs, ask the population to either judge or provide a preference-based rating, and using the pseudo-population's characteristics, adjust for confounding bias in reward models. Sorry for the ramble, not sure if any of this makes sense lol

Expand full comment
Sebastian Raschka, PhD's avatar

Am I understanding correctly that you are suggesting "LLM-as-a-judge ensemble methods"? Yes, that makes sense. I.e., one could use LLMs from different model families for this, and perhaps also additionally vary the rubric.

Expand full comment
Anshuman Kumar's avatar

Sorry for the late reply, have been juggling a lot in life. I'm not only suggesting ensembling different LLMs, but also ensembling personality types generated via LLMs. A close analogy are the WIRED YouTube video series titled, "Expert Explains One Concept in 5 Levels of Difficulty".

So, here is what I was thinking more concretely:

Step 1:

Use LLMs to adopt different roles/personalities, let's say:

1. Personality Type A (use a specific prompt to generate Type A with certain attributes)

2. Personality Type B (use a specific prompt to generate Type B with certain attributes)

...

n. Personality Type N (use a specific prompt to generate Type N with certain attributes)

Step 2:

Use a separate LLM from different model families, to generate a population (pseudo persons) of personality types A-to-N, let's say 100 people each. There should be 100*Num. Personality Type LLM instances (may not be computationally feasible to generate in parallel).

Step 3:

Now we have a population of pseudo persons with known attributes (from step 1). During RL instruction finetuning, use each of these pseudo persons to perform pairwise ranking.

Step 4:

One can now train a reward model that also uses the attributes of pseduo persons to train the LLM to account for person heterogeneity. Maybe one can even modify the DPO and GRPO objective functions to account for these attributes directly.

I love the idea of having different rubrics. Maybe one can randomize different rubric systems that each personality type pseduo person could use to minimize sensitivity on the same task (not sure if it would improve inter-annotator agreement)

Expand full comment
Tim Dingman's avatar

Demographic evals like this are coming. I think it's mostly been a matter of prioritization so far, e.g. if the model is still bad at instruction following (objective) then you don't worry so much about style (subjective). But now that a lot of the core capabilities are more than good enough for most users, it's time to turn to more varied preferences.

Expand full comment
Sebastian Raschka, PhD's avatar

Yes, right now most LLMs are built as "one-size fits all", which is maybe due to compute limitations. But I am sure there will be more prioritization like you suggest.

Expand full comment
Rushabh Doshi's avatar

Excellent article!

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks!

Expand full comment
tc's avatar

Great article! Thanks for sharing.

Expand full comment
sian cao's avatar

Great read, thanks for the excellent intro. By the way, this may be a typo?

> "Vice versa, if the current model loses against a lowly ranked model, it increases the rating only a little"

I think you mean wins against?

Expand full comment
Sebastian Raschka, PhD's avatar

Good catch, thanks! Should be fixed now.

Expand full comment
Katya Artemova's avatar

Great article, thank you, Sebastian! Where would you place human evaluation in your taxonomy? Do you think LLM judges should correlate with human judgement?

Expand full comment
Sebastian Raschka, PhD's avatar

Good point. I'd say the human part is overseeing and evaluating the evaluations. But the leaderboard is also usually human preference based. Plus, your intuition with LLM judges is spot on: I would say LLM judges evolved from human judges. If I recall correctly, the Llama 2 paper was one of the first showing both LLM and human evals and then people moved towards more LLM-based evals as human evals are harder to collect. Also, one might argue that LLM-based evals are more reproducible.

Expand full comment
Felix's avatar

Can you please teach us how to build agentic AI from scratch? Without langchain, crewAI etc.?

Expand full comment
Sebastian Raschka, PhD's avatar

One day! Currently still working on the engine for that (reasoning models from scratch)

Expand full comment
Tanner Davis's avatar

Awesome article! One minor typo I noticed is the heading for "LLMs as judges" should be Method 4.

Expand full comment
Sebastian Raschka, PhD's avatar

Oh yes, good catch. Should be "Method 4: " not "Method 3: ". Fixed it!

Expand full comment
Martin Miceli's avatar

Enlightening article Sebastian! Why not also including this article as an appendix in your last book, I believe it will be a great complementary addition to get a more holistic understanding? I’ve already read your first book (which was amazing by the way) and now reading the MEAP version of your reasoning models book - impatiently waiting for your 3rd chapter to be released.

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks! Glad you liked the book. And another yes: the appendix would be a good place for that.

PS: I submitted chapter 3 last week and hope it goes live soon. It usually takes the Manning team ~1 week for the preliminary formatting and uploading.

Expand full comment
suman suhag's avatar

Does the proof for the Lamport's bakery algorithm assume that values written to elements of "choosing" or "number" instantly become visible to threads on other cores, that may read these memory locations, or are writes allowed to have latency?

Expand full comment