Position bias in LLM-as-judge flipped 18% of our verdicts

TL;DR: Position bias in LLM-as-judge means the model favors whichever answer it reads first. We measured an 18% verdict flip rate from swapping order alone, and dual-pass scoring brought it under 4%.

Our pairwise evaluation harness at Nexus Labs scored answer A over answer B in 18% of cases purely because A appeared first in the judge prompt. We caught it when a regression in our agent-automation model showed a 6-point win on the leaderboard that vanished the moment a teammate reran the same comparisons with the candidate listed second. Position bias in LLM-as-judge is well documented, but most teams never measure it on their own data, so they ship on numbers that move when you shuffle the prompt. The judge model here was gpt-4o-2024-08-06, scoring 1,200 pairwise comparisons of customer-support agent responses.

This is the part of evaluation that gets skipped because the harness looks like it works. It returns scores. The scores have decimals. They go in a dashboard. Nobody checks whether the decimals mean anything.

What position bias in LLM-as-judge actually is

Position bias in LLM-as-judge is the tendency of a model to prefer a response based on where it sits in the prompt rather than its quality. When you ask a model to pick the better of two answers, listing the same answer first versus second changes the verdict at a measurable rate. The effect was named in Large Language Models are not Fair Evaluators and confirmed across judge models in the MT-Bench paper.

It is not random noise. The bias has a direction. In our runs gpt-4o preferred the first position about 11 points more often than chance would predict, which is consistent with the first-position skew reported in both papers above.

How we measured the flip rate

The measurement is cheap. For every pair, run the judge twice: once with the candidate in slot A, once in slot B. If the verdict changes when only the order changed, that pair is order-sensitive. The flip rate is the fraction of pairs where this happens.

def judge_pair(judge, question, resp_x, resp_y):
    # returns "x", "y", or "tie"
    return judge.compare(question, first=resp_x, second=resp_y)

flips = 0
for q, cand, base in pairs:
    v1 = judge_pair(judge, q, cand, base)   # candidate first
    v2 = judge_pair(judge, q, base, cand)   # candidate second
    # normalize v2 back to candidate-vs-base framing
    v2_norm = {"x": "y", "y": "x", "tie": "tie"}[v2]
    if v1 != v2_norm and "tie" not in (v1, v2_norm):
        flips += 1

flip_rate = flips / len(pairs)

We ran this across three judges. gpt-4o flipped on 18% of pairs, claude-3-5-sonnet on 12%, and a smaller gpt-4o-mini judge flipped on 29%. The smaller the judge, the worse the bias, which tracks with the intuition that weaker models lean harder on surface cues like ordering.

To run the same comparison set against multiple providers without writing a client per vendor, we put the judge calls behind Bifrost and pointed the harness at one OpenAI-compatible endpoint. That is the only infrastructure note here; the method works with any client you already have.

Dual-pass scoring and other fixes

The fix that worked was the boring one: judge every pair in both orders and only count a win when both passes agree. Disagreements become ties. This is the swap-and-average approach the MT-Bench authors recommend, and it dropped our flip-driven verdicts to under 4% of pairs, because a true difference in quality survives the swap while an order artifact does not.

Three approaches we compared:

Dual-pass with agreement gate. Run both orders, count a win only on agreement. Doubles judge cost, removes most order artifacts. This is what we shipped.
Score averaging. Average a numeric score across both orders instead of gating. Cheaper to reason about, but a confident wrong score in one order can still drag the mean.
Reference-anchored scoring. Score each answer independently against a rubric instead of head-to-head, as in G-Eval. Removes pairwise ordering entirely, but rubric scores are noisier and harder to calibrate across raters.

We also report Cohen's kappa between the two passes as a standing health metric. When kappa drops below 0.6 on a new judge or prompt template, we treat the judge as unreliable for that task and stop trusting its leaderboard until we debug the template.

Trade-offs and limitations

Dual-pass doubles judge token cost, which on 1,200 pairs at our prompt sizes added a few dollars per eval run. That is fine for release gates and unacceptable for per-request online scoring, so we only run it offline.

Gating on agreement inflates the tie count. Roughly a fifth of our previously decisive verdicts became ties, which makes small model improvements harder to detect. That is the correct outcome, not a bug: if a difference does not survive an order swap, calling it a win was the original mistake.

None of this addresses other judge biases. Length bias, self-preference when a model judges its own outputs, and sensitivity to formatting all persist. Position bias is the easiest one to measure, so it is the right place to start, not the place to stop.

Where to go next

If you run any LLM-as-judge pipeline, measure your flip rate before you touch anything else. It takes one extra pass over an existing comparison set and tells you whether your leaderboard reflects model quality or prompt ordering. I would run the swap test on your next eval, log Cohen's kappa between passes, and only then argue about which model won.