llm-evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
- risk
- unknown
- source
- community
- date added
- 2026-02-27
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Do not use this skill when
- The task is unrelated to llm evaluation
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Use this skill when
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
Quick Start
from llm_eval import EvaluationSuite, Metric # Define evaluation suite suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ]) # Prepare test cases test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... more test cases ] # Run evaluation results = suite.evaluate( model=your_model, test_cases=test_cases ) print(f"Overall Accuracy: {results.metrics['accuracy']}") print(f"BLEU Score: {results.metrics['bleu']}")
Automated Metrics Implementation
BLEU Score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction def calculate_bleu(reference, hypothesis): """Calculate BLEU score between reference and hypothesis.""" smoothie = SmoothingFunction().method4 return sentence_bleu( [reference.split()], hypothesis.split(), smoothing_function=smoothie ) # Usage bleu = calculate_bleu( reference="The cat sat on the mat", hypothesis="A cat is sitting on the mat" )
ROUGE Score
from rouge_score import rouge_scorer def calculate_rouge(reference, hypothesis): """Calculate ROUGE scores.""" scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) scores = scorer.score(reference, hypothesis) return { 'rouge1': scores['rouge1'].fmeasure, 'rouge2': scores['rouge2'].fmeasure, 'rougeL': scores['rougeL'].fmeasure }
BERTScore
from bert_score import score def calculate_bertscore(references, hypotheses): """Calculate BERTScore using pre-trained BERT.""" P, R, F1 = score( hypotheses, references, lang='en', model_type='microsoft/deberta-xlarge-mnli' ) return { 'precision': P.mean().item(), 'recall': R.mean().item(), 'f1': F1.mean().item() }
Custom Metrics
def calculate_groundedness(response, context): """Check if response is grounded in provided context.""" # Use NLI model to check entailment from transformers import pipeline nli = pipeline("text-classification", model="microsoft/deberta-large-mnli") result = nli(f"{context} [SEP] {response}")[0] # Return confidence that response is entailed by context return result['score'] if result['label'] == 'ENTAILMENT' else 0.0 def calculate_toxicity(text): """Measure toxicity in generated text.""" from detoxify import Detoxify results = Detoxify('original').predict(text) return max(results.values()) # Return highest toxicity score def calculate_factuality(claim, knowledge_base): """Verify factual claims against knowledge base.""" # Implementation depends on your knowledge base # Could use retrieval + NLI, or fact-checking API pass
LLM-as-Judge Patterns
Single Output Evaluation
def llm_judge_quality(response, question): """Use GPT-5 to judge response quality.""" prompt = f"""Rate the following response on a scale of 1-10 for: 1. Accuracy (factually correct) 2. Helpfulness (answers the question) 3. Clarity (well-written and understandable) Question: {question} Response: {response} Provide ratings in JSON format: {{ "accuracy": <1-10>, "helpfulness": <1-10>, "clarity": <1-10>, "reasoning": "<brief explanation>" }} """ result = openai.ChatCompletion.create( model="gpt-5", messages=[{"role": "user", "content": prompt}], temperature=0 ) return json.loads(result.choices[0].message.content)
Pairwise Comparison
def compare_responses(question, response_a, response_b): """Compare two responses using LLM judge.""" prompt = f"""Compare these two responses to the question and determine which is better. Question: {question} Response A: {response_a} Response B: {response_b} Which response is better and why? Consider accuracy, helpfulness, and clarity. Answer with JSON: {{ "winner": "A" or "B" or "tie", "reasoning": "<explanation>", "confidence": <1-10> }} """ result = openai.ChatCompletion.create( model="gpt-5", messages=[{"role": "user", "content": prompt}], temperature=0 ) return json.loads(result.choices[0].message.content)
Human Evaluation Frameworks
Annotation Guidelines
class AnnotationTask: """Structure for human annotation task.""" def __init__(self, response, question, context=None): self.response = response self.question = question self.context = context def get_annotation_form(self): return { "question": self.question, "context": self.context, "response": self.response, "ratings": { "accuracy": { "scale": "1-5", "description": "Is the response factually correct?" }, "relevance": { "scale": "1-5", "description": "Does it answer the question?" }, "coherence": { "scale": "1-5", "description": "Is it logically consistent?" } }, "issues": { "factual_error": False, "hallucination": False, "off_topic": False, "unsafe_content": False }, "feedback": "" }
Inter-Rater Agreement
from sklearn.metrics import cohen_kappa_score def calculate_agreement(rater1_scores, rater2_scores): """Calculate inter-rater agreement.""" kappa = cohen_kappa_score(rater1_scores, rater2_scores) interpretation = { kappa < 0: "Poor", kappa < 0.2: "Slight", kappa < 0.4: "Fair", kappa < 0.6: "Moderate", kappa < 0.8: "Substantial", kappa <= 1.0: "Almost Perfect" } return { "kappa": kappa, "interpretation": interpretation[True] }
A/B Testing
Statistical Testing Framework
from scipy import stats import numpy as np class ABTest: def __init__(self, variant_a_name="A", variant_b_name="B"): self.variant_a = {"name": variant_a_name, "scores": []} self.variant_b = {"name": variant_b_name, "scores": []} def add_result(self, variant, score): """Add evaluation result for a variant.""" if variant == "A": self.variant_a["scores"].append(score) else: self.variant_b["scores"].append(score) def analyze(self, alpha=0.05): """Perform statistical analysis.""" a_scores = self.variant_a["scores"] b_scores = self.variant_b["scores"] # T-test t_stat, p_value = stats.ttest_ind(a_scores, b_scores) # Effect size (Cohen's d) pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2) cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std return { "variant_a_mean": np.mean(a_scores), "variant_b_mean": np.mean(b_scores), "difference": np.mean(b_scores) - np.mean(a_scores), "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores), "p_value": p_value, "statistically_significant": p_value < alpha, "cohens_d": cohens_d, "effect_size": self.interpret_cohens_d(cohens_d), "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A" } @staticmethod def interpret_cohens_d(d): """Interpret Cohen's d effect size.""" abs_d = abs(d) if abs_d < 0.2: return "negligible" elif abs_d < 0.5: return "small" elif abs_d < 0.8: return "medium" else: return "large"
Regression Testing
Regression Detection
class RegressionDetector: def __init__(self, baseline_results, threshold=0.05): self.baseline = baseline_results self.threshold = threshold def check_for_regression(self, new_results): """Detect if new results show regression.""" regressions = [] for metric in self.baseline.keys(): baseline_score = self.baseline[metric] new_score = new_results.get(metric) if new_score is None: continue # Calculate relative change relative_change = (new_score - baseline_score) / baseline_score # Flag if significant decrease if relative_change < -self.threshold: regressions.append({ "metric": metric, "baseline": baseline_score, "current": new_score, "change": relative_change }) return { "has_regression": len(regressions) > 0, "regressions": regressions }
Benchmarking
Running Benchmarks
class BenchmarkRunner: def __init__(self, benchmark_dataset): self.dataset = benchmark_dataset def run_benchmark(self, model, metrics): """Run model on benchmark and calculate metrics.""" results = {metric.name: [] for metric in metrics} for example in self.dataset: # Generate prediction prediction = model.predict(example["input"]) # Calculate each metric for metric in metrics: score = metric.calculate( prediction=prediction, reference=example["reference"], context=example.get("context") ) results[metric.name].append(score) # Aggregate results return { metric: { "mean": np.mean(scores), "std": np.std(scores), "min": min(scores), "max": max(scores) } for metric, scores in results.items() }
Resources
- references/metrics.md: Comprehensive metric guide
- references/human-evaluation.md: Annotation best practices
- references/benchmarking.md: Standard benchmarks
- references/a-b-testing.md: Statistical testing guide
- references/regression-testing.md: CI/CD integration
- assets/evaluation-framework.py: Complete evaluation harness
- assets/benchmark-dataset.jsonl: Example datasets
- scripts/evaluate-model.py: Automated evaluation runner
Best Practices
- Multiple Metrics: Use diverse metrics for comprehensive view
- Representative Data: Test on real-world, diverse examples
- Baselines: Always compare against baseline performance
- Statistical Rigor: Use proper statistical tests for comparisons
- Continuous Evaluation: Integrate into CI/CD pipeline
- Human Validation: Combine automated metrics with human judgment
- Error Analysis: Investigate failures to understand weaknesses
- Version Control: Track evaluation results over time
Common Pitfalls
- Single Metric Obsession: Optimizing for one metric at the expense of others
- Small Sample Size: Drawing conclusions from too few examples
- Data Contamination: Testing on training data
- Ignoring Variance: Not accounting for statistical uncertainty
- Metric Mismatch: Using metrics not aligned with business goals