Why Most LLM Eval Is Theater
Most teams evaluating LLMs are doing it wrong. Not because they don't care about quality, but because the tools and practices haven't caught up to the problem.
Here's what I see over and over: someone writes a prompt, runs it against 10 examples, eyeballs the output, and says "looks good." That's not evaluation. That's a gut check.
The vibes problem
When you evaluate by reading outputs, you're doing two things at once. You're judging whether the output is good, and you're judging whether your judgment is consistent. Neither of those is reliable at small scale.
Human evaluators disagree with themselves about 20-30% of the time on subjective tasks. So when you read 10 outputs and say 8 of them are good, you don't actually know if 8 of them are good. You know that you thought 8 of them were good at that particular moment.
This is the foundation that every production LLM decision is built on. It should concern you.
LLM-as-judge makes it worse
The natural response to "human eval doesn't scale" is to use another LLM as the judge. This is fine in theory. In practice, nobody validates whether the judge is any good.
You now have two black boxes. One generates outputs. The other scores them. And you're trusting the second one because... it's an LLM and LLMs are smart?
The judge has its own biases. It might prefer longer outputs. It might anchor on the first sentence. It might score differently depending on where you put the reference answer in the prompt. If you haven't measured these things, you don't have an evaluation pipeline. You have a random number generator with good vibes.
What statistical rigor actually looks like
Real eval needs three things:
1. A validated judge. Before you trust an LLM to score outputs, you need to measure its agreement with human raters. Cohen's kappa, not just accuracy. You need to know if the judge has position bias, length bias, or self-preference. This is what JudgeBench does. It gives you a statistical report card for your judge before you use it in production.
2. Structured scoring with confidence intervals. A score of 4.2 out of 5 means nothing without a confidence interval. G-Eval gives you token probabilities that you can turn into proper distributions. Now you can say "this prompt scores 4.2 plus or minus 0.3" and that actually means something. EvalKit handles this.
3. Regression detection that isn't just "compare two numbers." When you update a prompt and the score goes from 4.2 to 4.0, is that a real regression or noise? McNemar's test tells you. It's a paired statistical test designed for exactly this situation: same inputs, two different treatments, binary outcomes. DriftWatch automates this.
The eval stack
These three tools chain together:
JudgeBench validates your judge. EvalKit scores your outputs with that judge. DriftWatch catches regressions when you change things.
Each one is independent and useful on its own, but together they give you something that almost nobody has right now: LLM quality assurance with real statistical power.
Why this matters
If you're shipping LLM features to users, you're making quality decisions every day. Every prompt change, every model swap, every system prompt tweak is a decision that affects output quality. And if your eval process is "run it a few times and see if it looks right," you're flying blind.
The bar for eval in production AI is going to go up fast. Teams that build rigorous eval infrastructure now will be able to move faster later, because they'll actually know when things break.
The tools exist. The statistics aren't new. The gap is just in applying them to LLM workflows. That's what I'm building.