OmniGAIA: A Benchmark for Omni-Modal AI Agents

Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†

¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University

💡 What is OmniGAIA?

OmniGAIA is a challenging benchmark for native omni-modal agents, featuring video / image / audio inputs across 9 real-world domains with 360 tasks. It explicitly requires multi-hop reasoning and multi-turn external tool use (web search, page browsing, code execution) to produce verifiable open-form answers.

The benchmark covers two reasoning patterns: Intra-Event queries (locating a target event via contextual clues and filtering within it) and Inter-Event queries (scanning across multiple events to find recurring elements under constraints).

📖 How to Read the Leaderboard

The leaderboard shows results across 9 category domains and 3 difficulty levels.

  • Category-Wise Breakdown: Geography, Technology, History, Finance, Sports, Art, Movie, Science, Food
  • Difficulty Levels: Easy, Medium, Hard
  • Overall: Aggregate accuracy across all tasks

Bold values = best in group  ·  Underlined values = second-best in group. Shaded rows indicate our OmniAtlas system.

📊 Evaluation Metrics

All metrics are computed at the set level:

  • Pass@1 (LLM-as-Judge): An LLM judge determines if the predicted answer is semantically equivalent to the ground truth. This is the primary metric.

Scores reported on the leaderboard represent Pass@1 accuracy (%).

📬 How to Submit

Prepare a .json file with two fields: meta (your method info) and predictions (your model outputs).

Scroll down to the Submit section below, upload the file, and the system will process your submission. Maintainers will review, run evaluation, and publish your results on the leaderboard.

Leaderboard

Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).

Rank Model / System Date Overall Difficulty Levels Category-Wise Breakdown
EasyMed.Hard Geo.Tech.Hist.Fin.Sport ArtMovieSci.Food
1
Gemini-3-Pro
Google
2025.12 62.5 78.7 61.9 38.5 65.2 59.2 62.1 72.0 78.4 52.8 48.5 42.3 88.9
2
Gemini-3-Flash
Google
2025.12 51.7 67.2 46.9 37.2 50.7 57.1 44.8 48.0 59.5 55.6 54.6 38.5 61.1
3
Gemini-2.5-Pro
Google
2025.3 30.8 41.8 26.9 21.8 23.2 28.6 32.8 20.0 32.4 41.7 42.4 26.9 33.3
4
OmniAtlas-Qwen-3-30B
RUC-NLPIR
2026.2 20.8 31.1 18.8 9.0 10.1 30.6 29.9 32.0 18.9 16.7 12.1 11.5 27.8
5
Qwen-3-Omni-30B
Alibaba
2025.9 13.3 19.7 10.6 9.0 8.7 14.3 11.9 28.0 10.8 13.9 9.1 15.4 22.2
6
OmniAtlas-Qwen-2.5-7B
RUC-NLPIR
2026.2 13.3 22.1 11.3 3.9 8.7 18.4 16.4 4.0 16.2 22.2 3.0 7.7 22.2
7
LongCat-Flash-Omni-560B
Inclusion AI
2025.9 11.1 16.4 9.4 6.4 8.7 10.2 16.4 12.0 10.8 8.3 6.1 11.5 16.7
8
OmniAtlas-Qwen-2.5-3B
RUC-NLPIR
2026.2 10.3 13.9 10.0 5.1 4.4 12.2 16.7 4.0 16.2 11.1 3.0 11.5 11.1
9
Gemini-2.5-Flash-Lite
Google
2025.3 8.6 9.8 8.1 7.7 5.8 8.2 14.9 4.0 10.8 8.3 6.1 3.9 11.1
10
Ming-Flash-Omni-100B
Inclusion AI
2025.8 8.3 12.3 7.5 3.8 5.8 8.2 10.4 12.0 8.1 5.6 6.1 11.5 11.1
11
Ming-Lite-Omni-1.5-20B
Inclusion AI
2025.5 3.9 4.9 3.8 2.6 2.9 6.1 1.5 4.0 5.4 2.8 6.1 7.7 5.6
12
Qwen-2.5-Omni-7B
Alibaba
2025.4 3.6 8.2 1.3 1.3 1.5 4.1 7.5 4.0 0.0 2.8 0.0 7.7 5.6
13
MiniCPM-O-2.6-8B
OpenBMB
2025.1 3.1 3.3 2.5 3.8 2.9 2.0 1.5 0.0 2.7 8.3 3.0 3.8 5.6
14
Baichuan-Omni-1.5-8B
Baichuan AI
2025.3 2.8 4.9 2.5 0.0 2.9 4.1 3.0 4.0 2.7 0.0 3.0 3.8 0.0
15
Qwen-2.5-Omni-3B
Alibaba
2025.4 1.4 1.6 1.9 0.0 0.0 2.0 4.5 0.0 0.0 0.0 0.0 3.9 0.0

📬 Submit via Upload

Upload a .json file containing your method metadata and predictions. The leaderboard reads published rows from a tracked Hugging Face dataset repo, and each uploaded submission is also archived there for auditing.

📤
1. Upload
Submit your JSON file below
🔍
2. Review
Maintainers review your submission
3. Publish
Scores evaluated and published

JSON Format

{
  "meta": {
    "method_name": "My-Agent",
    "organization": "My-Org",
    "project_url": "https://..."
  },
  "predictions": [
    {
      "id": "1",
      "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."}
      ],
      "predicted_answer": "final answer",
      "llm_equal": "0/1"
    }
    ...
  ]
}

Field Descriptions

meta.method_name
Display name shown on the leaderboard
meta.organization
Your team or organization name
meta.project_url
Link to paper, code, or project page
predictions
A non-empty list of task objects
predictions[i].id
Task ID
predictions[i].messages
A non-empty message list containing the model's full execution process
predictions[i].predicted_answer
Final answer for this task
predictions[i].llm_equal
LLM-as-judge result/flag for this task