Contributing
There are two contribution paths: proposing benchmark tasks and submitting benchmark results.
Tasks
Task specs live in the grafana/o11y-bench repository under tasks-spec/. Generated task output is derived from those specs and should not be edited directly.
- Add or update a task spec under
tasks-spec/<category>/<task-id>.yaml. - Regenerate the benchmark tasks from the task specs before opening a PR.
- Open a pull request against
grafana/o11y-bench.
Task contributions are reviewed for scenario quality, grading correctness, and fit with the public benchmark.
Submissions
Leaderboard submissions are stored in the Hugging Face dataset repo grafanalabs/o11y-bench-leaderboard.
- Run the benchmark and keep the full Harbor job directory for each model variant you want scored.
- Add your submission under
submissions/o11y-bench/1.0/<agent>__<model>/with ametadata.yaml. - Include the full job contents:
config.json, top-levelresult.json, all trial directories, and downloaded agent and verifier artifacts. - Open a pull request against the leaderboard dataset repo.
Submissions should preserve the benchmark’s shipped evaluation settings and currently use three attempts per task.
Review Process
Task contributions are reviewed in the benchmark repo. Result submissions are reviewed in the leaderboard dataset repo.
In both cases, maintainers are checking that the contribution matches the public benchmark structure and can be compared fairly against other runs.