OmniGAIA: A Benchmark for Omni-Modal AI Agents

Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†

¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University

Paper 💻 Code 🤗 Dataset

💡 What is OmniGAIA?

OmniGAIA is a challenging benchmark for native omni-modal agents, featuring video / image / audio inputs across 9 real-world domains with 360 tasks. It explicitly requires multi-hop reasoning and multi-turn external tool use (web search, page browsing, code execution) to produce verifiable open-form answers.

The benchmark covers two reasoning patterns: Intra-Event queries (locating a target event via contextual clues and filtering within it) and Inter-Event queries (scanning across multiple events to find recurring elements under constraints).

📖 How to Read the Leaderboard

The leaderboard shows results across 9 category domains and 3 difficulty levels.

Category-Wise Breakdown: Geography, Technology, History, Finance, Sports, Art, Movie, Science, Food
Difficulty Levels: Easy, Medium, Hard
Overall: Aggregate accuracy across all tasks

Bold values = best in group · Underlined values = second-best in group. Shaded rows indicate our OmniAtlas system.

📊 Evaluation Metrics

All metrics are computed at the set level:

Pass@1 (LLM-as-Judge): An LLM judge determines if the predicted answer is semantically equivalent to the ground truth. This is the primary metric.

Scores reported on the leaderboard represent Pass@1 accuracy (%).

📬 How to Submit

Prepare a .json file with two fields: meta (your method info) and predictions (your model outputs).

Scroll down to the Submit section below, upload the file, and the system will process your submission. Maintainers will review, run evaluation, and publish your results on the leaderboard.

Leaderboard

Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).

Rank	Model / System	Date	Overall	Difficulty Levels			Category-Wise Breakdown
Rank	Model / System	Date	Overall	Easy	Med.	Hard	Geo.	Tech.	Hist.	Fin.	Sport	Art	Movie	Sci.	Food
1	Orchestra-o1-GPT-5 CUHK	2026.4	72.8	80.3	75.0	56.4	72.5	69.4	75.8	64.0	83.8	63.9	69.7	73.1	83.3
2	Gemini-3-Pro Google	2025.12	62.5	78.7	61.9	38.5	65.2	59.2	62.1	72.0	78.4	52.8	48.5	42.3	88.9
3	Qwen3.5-Omni-Plus Alibaba	2026.3	57.2	-	-	-	-	-	-	-	-	-	-	-	-
4	Gemini-3-Flash Google	2025.12	51.7	67.2	46.9	37.2	50.7	57.1	44.8	48.0	59.5	55.6	54.6	38.5	61.1
5	Qwen3.5-Omni-Flash Alibaba	2026.3	33.9	-	-	-	-	-	-	-	-	-	-	-	-
6	Gemini-2.5-Pro Google	2025.3	30.8	41.8	26.9	21.8	23.2	28.6	32.8	20.0	32.4	41.7	42.4	26.9	33.3
7	OmniAtlas-Qwen-3-30B RUC-NLPIR	2026.2	20.8	31.1	18.8	9.0	10.1	30.6	29.9	32.0	18.9	16.7	12.1	11.5	27.8
8	Qwen-3-Omni-30B Alibaba	2025.9	13.3	19.7	10.6	9.0	8.7	14.3	11.9	28.0	10.8	13.9	9.1	15.4	22.2
9	OmniAtlas-Qwen-2.5-7B RUC-NLPIR	2026.2	13.3	22.1	11.3	3.9	8.7	18.4	16.4	4.0	16.2	22.2	3.0	7.7	22.2
10	LongCat-Flash-Omni-560B Inclusion AI	2025.9	11.1	16.4	9.4	6.4	8.7	10.2	16.4	12.0	10.8	8.3	6.1	11.5	16.7
11	OmniAtlas-Qwen-2.5-3B RUC-NLPIR	2026.2	10.3	13.9	10.0	5.1	4.4	12.2	16.7	4.0	16.2	11.1	3.0	11.5	11.1
12	Gemini-2.5-Flash-Lite Google	2025.3	8.6	9.8	8.1	7.7	5.8	8.2	14.9	4.0	10.8	8.3	6.1	3.9	11.1
13	Ming-Flash-Omni-100B Inclusion AI	2025.8	8.3	12.3	7.5	3.8	5.8	8.2	10.4	12.0	8.1	5.6	6.1	11.5	11.1
14	Ming-Lite-Omni-1.5-20B Inclusion AI	2025.5	3.9	4.9	3.8	2.6	2.9	6.1	1.5	4.0	5.4	2.8	6.1	7.7	5.6
15	Qwen-2.5-Omni-7B Alibaba	2025.4	3.6	8.2	1.3	1.3	1.5	4.1	7.5	4.0	0.0	2.8	0.0	7.7	5.6
16	MiniCPM-O-2.6-8B OpenBMB	2025.1	3.1	3.3	2.5	3.8	2.9	2.0	1.5	0.0	2.7	8.3	3.0	3.8	5.6
17	Baichuan-Omni-1.5-8B Baichuan AI	2025.3	2.8	4.9	2.5	0.0	2.9	4.1	3.0	4.0	2.7	0.0	3.0	3.8	0.0
18	Qwen-2.5-Omni-3B Alibaba	2025.4	1.4	1.6	1.9	0.0	0.0	2.0	4.5	0.0	0.0	0.0	0.0	3.9	0.0

📬 Submit via Upload

Upload a .json file containing your method metadata and predictions. The leaderboard reads published rows from a tracked Hugging Face dataset repo, and each uploaded submission is also archived there for auditing.

📤

1. Upload

Submit your JSON file below

→

🔍

2. Review

Maintainers review your submission

→

✅

3. Publish

Scores evaluated and published

JSON Format

{
  "meta": {
    "method_name": "My-Agent",
    "organization": "My-Org",
    "project_url": "https://..."
  },
  "predictions": [
    {
      "id": "1",
      "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."}
      ],
      "predicted_answer": "final answer",
      "llm_equal": "0/1"
    }
    ...
  ]
}

Field Descriptions

meta.method_name

Display name shown on the leaderboard

meta.organization

Your team or organization name

meta.project_url

Link to paper, code, or project page

predictions

A non-empty list of task objects

predictions[i].id

Task ID

predictions[i].messages

A non-empty message list containing the model's full execution process

predictions[i].predicted_answer

Final answer for this task

predictions[i].llm_equal

LLM-as-judge result/flag for this task