OmniGAIA: A Benchmark for Omni-Modal AI Agents
Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†
¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University
💡 What is OmniGAIA?
OmniGAIA is a challenging benchmark for native omni-modal agents, featuring video / image / audio inputs across 9 real-world domains with 360 tasks. It explicitly requires multi-hop reasoning and multi-turn external tool use (web search, page browsing, code execution) to produce verifiable open-form answers.
The benchmark covers two reasoning patterns: Intra-Event queries (locating a target event via contextual clues and filtering within it) and Inter-Event queries (scanning across multiple events to find recurring elements under constraints).
📖 How to Read the Leaderboard
The leaderboard shows results across 9 category domains and 3 difficulty levels.
- Category-Wise Breakdown: Geography, Technology, History, Finance, Sports, Art, Movie, Science, Food
- Difficulty Levels: Easy, Medium, Hard
- Overall: Aggregate accuracy across all tasks
Bold values = best in group · Underlined values = second-best in group. Shaded rows indicate our OmniAtlas system.
📊 Evaluation Metrics
All metrics are computed at the set level:
- Pass@1 (LLM-as-Judge): An LLM judge determines if the predicted answer is semantically equivalent to the ground truth. This is the primary metric.
Scores reported on the leaderboard represent Pass@1 accuracy (%).
📬 How to Submit
Prepare a .json file with two fields:
meta (your method info) and
predictions (your model outputs).
Scroll down to the Submit section below, upload the file, and the system will process your submission. Maintainers will review, run evaluation, and publish your results on the leaderboard.
Leaderboard
Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).
| Rank | Model / System | Date | Overall | Difficulty Levels | Category-Wise Breakdown | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Med. | Hard | Geo. | Tech. | Hist. | Fin. | Sport | Art | Movie | Sci. | Food | ||||
| 1 | Gemini-3-Pro Google |
2025.12 | 62.5 | 78.7 | 61.9 | 38.5 | 65.2 | 59.2 | 62.1 | 72.0 | 78.4 | 52.8 | 48.5 | 42.3 | 88.9 |
| 2 | Gemini-3-Flash Google |
2025.12 | 51.7 | 67.2 | 46.9 | 37.2 | 50.7 | 57.1 | 44.8 | 48.0 | 59.5 | 55.6 | 54.6 | 38.5 | 61.1 |
| 3 | Gemini-2.5-Pro Google |
2025.3 | 30.8 | 41.8 | 26.9 | 21.8 | 23.2 | 28.6 | 32.8 | 20.0 | 32.4 | 41.7 | 42.4 | 26.9 | 33.3 |
| 4 | OmniAtlas-Qwen-3-30B RUC-NLPIR |
2026.2 | 20.8 | 31.1 | 18.8 | 9.0 | 10.1 | 30.6 | 29.9 | 32.0 | 18.9 | 16.7 | 12.1 | 11.5 | 27.8 |
| 5 | Qwen-3-Omni-30B Alibaba |
2025.9 | 13.3 | 19.7 | 10.6 | 9.0 | 8.7 | 14.3 | 11.9 | 28.0 | 10.8 | 13.9 | 9.1 | 15.4 | 22.2 |
| 6 | OmniAtlas-Qwen-2.5-7B RUC-NLPIR |
2026.2 | 13.3 | 22.1 | 11.3 | 3.9 | 8.7 | 18.4 | 16.4 | 4.0 | 16.2 | 22.2 | 3.0 | 7.7 | 22.2 |
| 7 | LongCat-Flash-Omni-560B Inclusion AI |
2025.9 | 11.1 | 16.4 | 9.4 | 6.4 | 8.7 | 10.2 | 16.4 | 12.0 | 10.8 | 8.3 | 6.1 | 11.5 | 16.7 |
| 8 | OmniAtlas-Qwen-2.5-3B RUC-NLPIR |
2026.2 | 10.3 | 13.9 | 10.0 | 5.1 | 4.4 | 12.2 | 16.7 | 4.0 | 16.2 | 11.1 | 3.0 | 11.5 | 11.1 |
| 9 | Gemini-2.5-Flash-Lite Google |
2025.3 | 8.6 | 9.8 | 8.1 | 7.7 | 5.8 | 8.2 | 14.9 | 4.0 | 10.8 | 8.3 | 6.1 | 3.9 | 11.1 |
| 10 | Ming-Flash-Omni-100B Inclusion AI |
2025.8 | 8.3 | 12.3 | 7.5 | 3.8 | 5.8 | 8.2 | 10.4 | 12.0 | 8.1 | 5.6 | 6.1 | 11.5 | 11.1 |
| 11 | Ming-Lite-Omni-1.5-20B Inclusion AI |
2025.5 | 3.9 | 4.9 | 3.8 | 2.6 | 2.9 | 6.1 | 1.5 | 4.0 | 5.4 | 2.8 | 6.1 | 7.7 | 5.6 |
| 12 | Qwen-2.5-Omni-7B Alibaba |
2025.4 | 3.6 | 8.2 | 1.3 | 1.3 | 1.5 | 4.1 | 7.5 | 4.0 | 0.0 | 2.8 | 0.0 | 7.7 | 5.6 |
| 13 | MiniCPM-O-2.6-8B OpenBMB |
2025.1 | 3.1 | 3.3 | 2.5 | 3.8 | 2.9 | 2.0 | 1.5 | 0.0 | 2.7 | 8.3 | 3.0 | 3.8 | 5.6 |
| 14 | Baichuan-Omni-1.5-8B Baichuan AI |
2025.3 | 2.8 | 4.9 | 2.5 | 0.0 | 2.9 | 4.1 | 3.0 | 4.0 | 2.7 | 0.0 | 3.0 | 3.8 | 0.0 |
| 15 | Qwen-2.5-Omni-3B Alibaba |
2025.4 | 1.4 | 1.6 | 1.9 | 0.0 | 0.0 | 2.0 | 4.5 | 0.0 | 0.0 | 0.0 | 0.0 | 3.9 | 0.0 |
📬 Submit via Upload
Upload a .json file containing your method metadata and predictions. The leaderboard reads published rows from a tracked Hugging Face dataset repo, and each uploaded submission is also archived there for auditing.
JSON Format
"meta": {
"method_name": "My-Agent",
"organization": "My-Org",
"project_url": "https://..."
},
"predictions": [
{
"id": "1",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"predicted_answer": "final answer",
"llm_equal": "0/1"
}
...
]
}
Field Descriptions
meta.method_namemeta.organizationmeta.project_urlpredictionspredictions[i].idpredictions[i].messagespredictions[i].predicted_answerpredictions[i].llm_equal