Google researchers reveal that standard 1-5 raters per benchmark item is insufficient, introducing a mathematical framework to optimize the breadth-vs-depth trade-off in AI evaluation.
Google researchers published 'Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,' introducing a simulator-based framework for optimizing AI benchmark design. The research addresses a systemic blind spot: most AI evaluations use plurality voting across 1-5 raters, which collapses human disagreement into a single label and undermines reproducibility. The framework provides practitioners with mathematical tools to determine the optimal number of items (N) and raters per item (K) based on budget constraints. The work was conducted in collaboration with RIT PhD student Deepak Pandita and Prof. Christopher Homan.
If you're building eval pipelines with 3-5 raters per item and treating plurality as ground truth, your reproducibility is probably worse than you think. This research formalizes what the ML community has quietly known: collapsing disagreement into a single label destroys signal, especially on subjective tasks like toxicity, intent classification, or preference ranking. The (N,K) simulator gives you a principled way to decide whether to annotate 1,000 items with 2 raters or 200 items with 10 — and that choice meaningfully changes what your benchmark actually measures.
If you're running RLHF preference data collection or any subjective annotation task this sprint, pressure-test your current rater count: calculate inter-annotator agreement (Cohen's Kappa or Krippendorff's alpha) on a 50-item sample with your existing rater count vs. doubling K — if Kappa shifts by more than 0.1, your benchmark is underpowered.
Open a Python environment and install: pip install nltk scikit-learn numpy
Tags