Contributing

There are two contribution paths: proposing benchmark tasks and submitting benchmark results.

Tasks

Task specs live in the grafana/o11y-bench repository under tasks-spec/. Generated task output is derived from those specs and should not be edited directly.

  1. Add or update a task spec under tasks-spec/<category>/<task-id>.yaml.
  2. Regenerate the benchmark tasks from the task specs before opening a PR.
  3. Open a pull request against grafana/o11y-bench.

Task contributions are reviewed for scenario quality, grading correctness, and fit with the public benchmark.

Submissions

Leaderboard submissions are stored in the Hugging Face dataset repo grafanalabs/o11y-bench-leaderboard.

  1. Run the benchmark and keep the full Harbor job directory for each model variant you want scored.
  2. Add your submission under submissions/o11y-bench/1.0/<agent>__<model>/ with a metadata.yaml.
  3. Include the full job contents: config.json, top-level result.json, all trial directories, and downloaded agent and verifier artifacts.
  4. Open a pull request against the leaderboard dataset repo.

Submissions should preserve the benchmark’s shipped evaluation settings and currently use three attempts per task.

Review Process

Task contributions are reviewed in the benchmark repo. Result submissions are reviewed in the leaderboard dataset repo.

In both cases, maintainers are checking that the contribution matches the public benchmark structure and can be compared fairly against other runs.