LLM-as-Judge
Using an LLM to evaluate the quality of another model's outputs against a rubric. Scales subjective evaluation, but must be calibrated against human labels to be trustworthy.
Why use it
Human evaluation is the gold standard but does not scale. LLM-as-judge approximates human judgment for subjective qualities (helpfulness, coherence, faithfulness) cheaply and at volume — making it the backbone of modern LLM evals.
Good practice for the rubric
- Be specific: “rate 1-5 for factual grounding in the provided context” beats “rate quality”.
- Provide anchor examples for each score.
- Prefer pairwise comparison (A vs B) when absolute scoring is noisy.