Leaderboard

The leaderboard is the public view of system performance from completed evaluations. It helps you see which models perform best on which datasets and domains.

What is the leaderboard

  • A public page showing ranked results from human evaluations (and, when available, automatic metrics).
  • Columns: overall rank (competition ranking), model name, average rate (μ), rate standard deviation (σ), dataset name, number of evaluations (votes), and metadata (e.g. organization, license).
  • You can pin models to keep them visible for comparison when scrolling.

How leaderboard scores are calculated and ranked

  • Model names may be anonymized in tasks (e.g. Model A, Model B); a mapping (task_models_shuffles) stores the real names for scoring.
  • For each model: aggregate rate and rank across all evaluated tasks; compute average rank (lower is better) and average rate (higher is better).
  • Sort: first by preference for better ranks (e.g. more 1st-place ranks), then by average rate to break ties.
  • Overall rank uses competition ranking: ties get the same rank; the next rank skips (e.g. 1, 2, 2, 4).

Automatic metrics tab

When available, a second tab shows automatic metrics (e.g. BLEU, chrF) per model and dataset. This allows comparison of human scores with reference-based metrics.