OmniGAIA: A Benchmark for Omni-Modal AI Agents
Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†
¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University
🏆 OmniGAIA Leaderboard
Track and compare models on the omni-modal agent tasks
💡 What is OmniGAIA?
OmniGAIA is an advanced benchmark designed to evaluate native omni-modal AI agents capable of joint reasoning across vision, audio, and language. It features 360 open-form tasks across 9 real-world domains, tested under two primary settings: video with audio and image with audio.
Unlike standard perception-based Q&A, OmniGAIA tasks demand complex multi-hop reasoning and tool-integrated problem solving (e.g., executing web searches, browsing, and code execution).
📖 How to Read the Leaderboard
The leaderboard ranks models based on their Overall score on the official test split, with all values reported as percentages (higher is better). Key metrics include:
- Overall: The average score across all 360 tasks.
- Difficulty: Results divided by task complexity (Easy, Medium, Hard).
- Category: Results divided by specific domains (Geography, Technology, History, Finance, Sports, Art, Movies, Science, and Food).
📊 Evaluation Metrics
The leaderboard uses Pass@1 Accuracy (%) as the primary metric. Task-level correctness is determined through a two-stage evaluation:
- Stage 1 (Exact Match): The system extracts the text between
<answer>and</answer>tags in the predicted output. If this text matches the labeled answer exactly, the prediction is correct. - Stage 2 (LLM-as-a-Judge): If the exact match fails, the system leverages LLM-as-a-Judge (
llm_equal) to assess whether the extracted text is semantically equivalent to the labeled answer.
📬 How to Submit
To submit your model's results, upload a formatted .json
file containing both meta and predictions
data in the designated Submit section.
- Required fields: Your file must include
meta.method_name. Additionally, every prediction entry must containid,messages,predicted_answer, andllm_equal. - Prediction trace: You must include the complete interaction trajectory within the
messagesfield to ensure a transparent review process.
We will manually review and publish valid submissions.
Leaderboard
Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).
| Rank | Model / System | Date | Overall | Difficulty Levels | Category-Wise Breakdown | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Med. | Hard | Geo. | Tech. | Hist. | Fin. | Sport | Art | Movie | Sci. | Food | ||||
| 1 | Gemini-3-Pro Google |
2025.12 | 62.5 | 78.7 | 61.9 | 38.5 | 65.2 | 59.2 | 62.1 | 72.0 | 78.4 | 52.8 | 48.5 | 42.3 | 88.9 |
| 2 | Gemini-3-Flash Google |
2025.12 | 51.7 | 67.2 | 46.9 | 37.2 | 50.7 | 57.1 | 44.8 | 48.0 | 59.5 | 55.6 | 54.6 | 38.5 | 61.1 |
| 3 | Gemini-2.5-Pro Google |
2025.3 | 30.8 | 41.8 | 26.9 | 21.8 | 23.2 | 28.6 | 32.8 | 20.0 | 32.4 | 41.7 | 42.4 | 26.9 | 33.3 |
| 4 | OmniAtlas-Qwen-3-30B RUC-NLPIR |
2026.2 | 20.8 | 31.1 | 18.8 | 9.0 | 10.1 | 30.6 | 29.9 | 32.0 | 18.9 | 16.7 | 12.1 | 11.5 | 27.8 |
| 5 | Qwen-3-Omni-30B Alibaba |
2025.9 | 13.3 | 19.7 | 10.6 | 9.0 | 8.7 | 14.3 | 11.9 | 28.0 | 10.8 | 13.9 | 9.1 | 15.4 | 22.2 |
| 6 | OmniAtlas-Qwen-2.5-7B RUC-NLPIR |
2026.2 | 13.3 | 22.1 | 11.3 | 3.9 | 8.7 | 18.4 | 16.4 | 4.0 | 16.2 | 22.2 | 3.0 | 7.7 | 22.2 |
| 7 | LongCat-Flash-Omni-560B Inclusion AI |
2025.9 | 11.1 | 16.4 | 9.4 | 6.4 | 8.7 | 10.2 | 16.4 | 12.0 | 10.8 | 8.3 | 6.1 | 11.5 | 16.7 |
| 8 | OmniAtlas-Qwen-2.5-3B RUC-NLPIR |
2026.2 | 10.3 | 13.9 | 10.0 | 5.1 | 4.4 | 12.2 | 16.7 | 4.0 | 16.2 | 11.1 | 3.0 | 11.5 | 11.1 |
| 9 | Gemini-2.5-Flash-Lite Google |
2025.3 | 8.6 | 9.8 | 8.1 | 7.7 | 5.8 | 8.2 | 14.9 | 4.0 | 10.8 | 8.3 | 6.1 | 3.9 | 11.1 |
| 10 | Ming-Flash-Omni-100B Inclusion AI |
2025.8 | 8.3 | 12.3 | 7.5 | 3.8 | 5.8 | 8.2 | 10.4 | 12.0 | 8.1 | 5.6 | 6.1 | 11.5 | 11.1 |
| 11 | Ming-Lite-Omni-1.5-20B Inclusion AI |
2025.5 | 3.9 | 4.9 | 3.8 | 2.6 | 2.9 | 6.1 | 1.5 | 4.0 | 5.4 | 2.8 | 6.1 | 7.7 | 5.6 |
| 12 | Qwen-2.5-Omni-7B Alibaba |
2025.4 | 3.6 | 8.2 | 1.3 | 1.3 | 1.5 | 4.1 | 7.5 | 4.0 | 0.0 | 2.8 | 0.0 | 7.7 | 5.6 |
| 13 | MiniCPM-O-2.6-8B OpenBMB |
2025.1 | 3.1 | 3.3 | 2.5 | 3.8 | 2.9 | 2.0 | 1.5 | 0.0 | 2.7 | 8.3 | 3.0 | 3.8 | 5.6 |
| 14 | Baichuan-Omni-1.5-8B Baichuan AI |
2025.3 | 2.8 | 4.9 | 2.5 | 0.0 | 2.9 | 4.1 | 3.0 | 4.0 | 2.7 | 0.0 | 3.0 | 3.8 | 0.0 |
| 15 | Qwen-2.5-Omni-3B Alibaba |
2025.4 | 1.4 | 1.6 | 1.9 | 0.0 | 0.0 | 2.0 | 4.5 | 0.0 | 0.0 | 0.0 | 0.0 | 3.9 | 0.0 |
📬 Submit via Upload
Upload a .json file containing your method metadata and predictions. We will review your submission, run evaluation, and merge results into the leaderboard.
JSON Format
"meta": {
"method_name": "My-Agent",
"organization": "My-Org",
"project_url": "https://..."
},
"predictions": [
{
"id": "1",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"predicted_answer": "final answer",
"llm_equal": "0/1"
},
...
]
}
Field Descriptions
meta.method_namemeta.organizationmeta.project_urlpredictionspredictions[i].idpredictions[i].messagespredictions[i].predicted_answerpredictions[i].llm_equal