OmniGAIA: A Benchmark for Omni-Modal AI Agents

Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†

¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University

🏆 OmniGAIA Leaderboard

Track and compare models on the omni-modal agent tasks

💡 What is OmniGAIA?

OmniGAIA is an advanced benchmark designed to evaluate native omni-modal AI agents capable of joint reasoning across vision, audio, and language. It features 360 open-form tasks across 9 real-world domains, tested under two primary settings: video with audio and image with audio.

Unlike standard perception-based Q&A, OmniGAIA tasks demand complex multi-hop reasoning and tool-integrated problem solving (e.g., executing web searches, browsing, and code execution).

📖 How to Read the Leaderboard

The leaderboard ranks models based on their Overall score on the official test split, with all values reported as percentages (higher is better). Key metrics include:

  • Overall: The average score across all 360 tasks.
  • Difficulty: Results divided by task complexity (Easy, Medium, Hard).
  • Category: Results divided by specific domains (Geography, Technology, History, Finance, Sports, Art, Movies, Science, and Food).

📊 Evaluation Metrics

The leaderboard uses Pass@1 Accuracy (%) as the primary metric. Task-level correctness is determined through a two-stage evaluation:

  • Stage 1 (Exact Match): The system extracts the text between <answer> and </answer> tags in the predicted output. If this text matches the labeled answer exactly, the prediction is correct.
  • Stage 2 (LLM-as-a-Judge): If the exact match fails, the system leverages LLM-as-a-Judge (llm_equal) to assess whether the extracted text is semantically equivalent to the labeled answer.

📬 How to Submit

To submit your model's results, upload a formatted .json file containing both meta and predictions data in the designated Submit section.

  • Required fields: Your file must include meta.method_name. Additionally, every prediction entry must contain id, messages, predicted_answer, and llm_equal.
  • Prediction trace: You must include the complete interaction trajectory within the messages field to ensure a transparent review process.

We will manually review and publish valid submissions.

Leaderboard

Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).

Rank Model / System Date Overall Difficulty Levels Category-Wise Breakdown
EasyMed.Hard Geo.Tech.Hist.Fin.Sport ArtMovieSci.Food
1
Gemini-3-Pro
Google
2025.12 62.5 78.7 61.9 38.5 65.2 59.2 62.1 72.0 78.4 52.8 48.5 42.3 88.9
2
Gemini-3-Flash
Google
2025.12 51.7 67.2 46.9 37.2 50.7 57.1 44.8 48.0 59.5 55.6 54.6 38.5 61.1
3
Gemini-2.5-Pro
Google
2025.3 30.8 41.8 26.9 21.8 23.2 28.6 32.8 20.0 32.4 41.7 42.4 26.9 33.3
4
OmniAtlas-Qwen-3-30B
RUC-NLPIR
2026.2 20.8 31.1 18.8 9.0 10.1 30.6 29.9 32.0 18.9 16.7 12.1 11.5 27.8
5
Qwen-3-Omni-30B
Alibaba
2025.9 13.3 19.7 10.6 9.0 8.7 14.3 11.9 28.0 10.8 13.9 9.1 15.4 22.2
6
OmniAtlas-Qwen-2.5-7B
RUC-NLPIR
2026.2 13.3 22.1 11.3 3.9 8.7 18.4 16.4 4.0 16.2 22.2 3.0 7.7 22.2
7
LongCat-Flash-Omni-560B
Inclusion AI
2025.9 11.1 16.4 9.4 6.4 8.7 10.2 16.4 12.0 10.8 8.3 6.1 11.5 16.7
8
OmniAtlas-Qwen-2.5-3B
RUC-NLPIR
2026.2 10.3 13.9 10.0 5.1 4.4 12.2 16.7 4.0 16.2 11.1 3.0 11.5 11.1
9
Gemini-2.5-Flash-Lite
Google
2025.3 8.6 9.8 8.1 7.7 5.8 8.2 14.9 4.0 10.8 8.3 6.1 3.9 11.1
10
Ming-Flash-Omni-100B
Inclusion AI
2025.8 8.3 12.3 7.5 3.8 5.8 8.2 10.4 12.0 8.1 5.6 6.1 11.5 11.1
11
Ming-Lite-Omni-1.5-20B
Inclusion AI
2025.5 3.9 4.9 3.8 2.6 2.9 6.1 1.5 4.0 5.4 2.8 6.1 7.7 5.6
12
Qwen-2.5-Omni-7B
Alibaba
2025.4 3.6 8.2 1.3 1.3 1.5 4.1 7.5 4.0 0.0 2.8 0.0 7.7 5.6
13
MiniCPM-O-2.6-8B
OpenBMB
2025.1 3.1 3.3 2.5 3.8 2.9 2.0 1.5 0.0 2.7 8.3 3.0 3.8 5.6
14
Baichuan-Omni-1.5-8B
Baichuan AI
2025.3 2.8 4.9 2.5 0.0 2.9 4.1 3.0 4.0 2.7 0.0 3.0 3.8 0.0
15
Qwen-2.5-Omni-3B
Alibaba
2025.4 1.4 1.6 1.9 0.0 0.0 2.0 4.5 0.0 0.0 0.0 0.0 3.9 0.0

📬 Submit via Upload

Upload a .json file containing your method metadata and predictions. We will review your submission, run evaluation, and merge results into the leaderboard.

📤
1. Upload
Submit your JSON file below
🔍
2. Review
We review your submission
3. Publish
Scores evaluated and published

JSON Format

{
  "meta": {
    "method_name": "My-Agent",
    "organization": "My-Org",
    "project_url": "https://..."
  },
  "predictions": [
    {
      "id": "1",
      "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."}
      ],
      "predicted_answer": "final answer",
      "llm_equal": "0/1"
    },
    ...
  ]
}

Field Descriptions

meta.method_name
Display name shown on the leaderboard
meta.organization
Your team or organization name
meta.project_url
Link to paper, code, or project page
predictions
A non-empty list of task objects
predictions[i].id
Task ID
predictions[i].messages
A non-empty message list containing the model's full execution process
predictions[i].predicted_answer
Final answer for this task
predictions[i].llm_equal
LLM-as-judge result/flag for this task