Leaderboard
The leaderboard is the public view of system performance from completed evaluations. It helps you see which models perform best on which datasets and domains.
What is the leaderboard
- A public page showing ranked results from human evaluations (and, when available, automatic metrics).
- Columns: overall rank (competition ranking), model name, average rate (μ), rate standard deviation (σ), dataset name, number of evaluations (votes), and metadata (e.g. organization, license).
- You can pin models to keep them visible for comparison when scrolling.
How leaderboard scores are calculated and ranked
- Model names may be anonymized in tasks (e.g. Model A, Model B); a mapping (
task_models_shuffles) stores the real names for scoring. - For each model: aggregate rate and rank across all evaluated tasks; compute average rank (lower is better) and average rate (higher is better).
- Sort: first by preference for better ranks (e.g. more 1st-place ranks), then by average rate to break ties.
- Overall rank uses competition ranking: ties get the same rank; the next rank skips (e.g. 1, 2, 2, 4).
Automatic metrics tab
When available, a second tab shows automatic metrics (e.g. BLEU, chrF) per model and dataset. This allows comparison of human scores with reference-based metrics.