Técnica Investigación advanced EN

LLM-as-Judge

Using an LLM to evaluate the quality of another model's outputs against a rubric. Scales subjective evaluation, but must be calibrated against human labels to be trustworthy.

Why use it

Human evaluation is the gold standard but does not scale. LLM-as-judge approximates human judgment for subjective qualities (helpfulness, coherence, faithfulness) cheaply and at volume — making it the backbone of modern LLM evals.

Good practice for the rubric

  • Be specific: “rate 1-5 for factual grounding in the provided context” beats “rate quality”.
  • Provide anchor examples for each score.
  • Prefer pairwise comparison (A vs B) when absolute scoring is noisy.

Grafo de conocimiento