OmniGAIA: A Benchmark for Omni-Modal AI Agents

Xiaoxi Li¹˒², Wenxiang Jiao², Jiarui Jin², Shijian Wang³, Guanting Dong¹, Jiajie Jin¹, Hao Wang⁴,
Yinuo Wang⁵, Ji-Rong Wen¹, Yuan Lu², Zhicheng Dou¹†

¹Renmin University of China · ²Xiaohongshu Inc. · ³Southeast University · ⁴Zhejiang University · ⁵Tsinghua University

Paper 💻 Code 🤗 Dataset

🏆 OmniGAIA Leaderboard

Track and compare models on the omni-modal agent tasks

💡 What is OmniGAIA?

OmniGAIA is an advanced benchmark designed to evaluate native omni-modal AI agents capable of joint reasoning across vision, audio, and language. It features 360 open-form tasks across 9 real-world domains, tested under two primary settings: video with audio and image with audio.

Unlike standard perception-based Q&A, OmniGAIA tasks demand complex multi-hop reasoning and tool-integrated problem solving (e.g., executing web searches, browsing, and code execution).

📖 How to Read the Leaderboard

The leaderboard ranks models based on their Overall score on the official test split, with all values reported as percentages (higher is better). Key metrics include:

Overall: The average score across all 360 tasks.
Difficulty: Results divided by task complexity (Easy, Medium, Hard).
Category: Results divided by specific domains (Geography, Technology, History, Finance, Sports, Art, Movies, Science, and Food).

📊 Evaluation Metrics

The leaderboard uses Pass@1 Accuracy (%) as the primary metric. Task-level correctness is determined through a two-stage evaluation:

Stage 1 (Exact Match): The system extracts the text between <answer> and </answer> tags in the predicted output. If this text matches the labeled answer exactly, the prediction is correct.
Stage 2 (LLM-as-a-Judge): If the exact match fails, the system leverages LLM-as-a-Judge (llm_equal) to assess whether the extracted text is semantically equivalent to the labeled answer.

📬 How to Submit

To submit your model's results, upload a formatted .json file containing both meta and predictions data in the designated Submit section.

Required fields: Your file must include meta.method_name. Additionally, every prediction entry must contain id, messages, predicted_answer, and llm_equal.
Prediction trace: You must include the complete interaction trajectory within the messages field to ensure a transparent review process.

We will manually review and publish valid submissions.

Leaderboard

Model rankings on the official test split. Use search to filter by model name. Scores are percentages (higher is better).

Rank	Model / System	Date	Overall	Difficulty Levels			Category-Wise Breakdown
Rank	Model / System	Date	Overall	Easy	Med.	Hard	Geo.	Tech.	Hist.	Fin.	Sport	Art	Movie	Sci.	Food
1	Gemini-3-Pro Google	2025.12	62.5	78.7	61.9	38.5	65.2	59.2	62.1	72.0	78.4	52.8	48.5	42.3	88.9
2	Gemini-3-Flash Google	2025.12	51.7	67.2	46.9	37.2	50.7	57.1	44.8	48.0	59.5	55.6	54.6	38.5	61.1
3	Gemini-2.5-Pro Google	2025.3	30.8	41.8	26.9	21.8	23.2	28.6	32.8	20.0	32.4	41.7	42.4	26.9	33.3
4	OmniAtlas-Qwen-3-30B RUC-NLPIR	2026.2	20.8	31.1	18.8	9.0	10.1	30.6	29.9	32.0	18.9	16.7	12.1	11.5	27.8
5	Qwen-3-Omni-30B Alibaba	2025.9	13.3	19.7	10.6	9.0	8.7	14.3	11.9	28.0	10.8	13.9	9.1	15.4	22.2
6	OmniAtlas-Qwen-2.5-7B RUC-NLPIR	2026.2	13.3	22.1	11.3	3.9	8.7	18.4	16.4	4.0	16.2	22.2	3.0	7.7	22.2
7	LongCat-Flash-Omni-560B Inclusion AI	2025.9	11.1	16.4	9.4	6.4	8.7	10.2	16.4	12.0	10.8	8.3	6.1	11.5	16.7
8	OmniAtlas-Qwen-2.5-3B RUC-NLPIR	2026.2	10.3	13.9	10.0	5.1	4.4	12.2	16.7	4.0	16.2	11.1	3.0	11.5	11.1
9	Gemini-2.5-Flash-Lite Google	2025.3	8.6	9.8	8.1	7.7	5.8	8.2	14.9	4.0	10.8	8.3	6.1	3.9	11.1
10	Ming-Flash-Omni-100B Inclusion AI	2025.8	8.3	12.3	7.5	3.8	5.8	8.2	10.4	12.0	8.1	5.6	6.1	11.5	11.1
11	Ming-Lite-Omni-1.5-20B Inclusion AI	2025.5	3.9	4.9	3.8	2.6	2.9	6.1	1.5	4.0	5.4	2.8	6.1	7.7	5.6
12	Qwen-2.5-Omni-7B Alibaba	2025.4	3.6	8.2	1.3	1.3	1.5	4.1	7.5	4.0	0.0	2.8	0.0	7.7	5.6
13	MiniCPM-O-2.6-8B OpenBMB	2025.1	3.1	3.3	2.5	3.8	2.9	2.0	1.5	0.0	2.7	8.3	3.0	3.8	5.6
14	Baichuan-Omni-1.5-8B Baichuan AI	2025.3	2.8	4.9	2.5	0.0	2.9	4.1	3.0	4.0	2.7	0.0	3.0	3.8	0.0
15	Qwen-2.5-Omni-3B Alibaba	2025.4	1.4	1.6	1.9	0.0	0.0	2.0	4.5	0.0	0.0	0.0	0.0	3.9	0.0

📬 Submit via Upload

Upload a .json file containing your method metadata and predictions. We will review your submission, run evaluation, and merge results into the leaderboard.

📤

1. Upload

Submit your JSON file below

→

🔍

2. Review

We review your submission

→

✅

3. Publish

Scores evaluated and published

JSON Format

{
  "meta": {
    "method_name": "My-Agent",
    "organization": "My-Org",
    "project_url": "https://..."
  },
  "predictions": [
    {
      "id": "1",
      "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."}
      ],
      "predicted_answer": "final answer",
      "llm_equal": "0/1"
    },
    ...
  ]
}

Field Descriptions

meta.method_name

Display name shown on the leaderboard

meta.organization

Your team or organization name

meta.project_url

Link to paper, code, or project page

predictions

A non-empty list of task objects

predictions[i].id

Task ID

predictions[i].messages

A non-empty message list containing the model's full execution process

predictions[i].predicted_answer

Final answer for this task

predictions[i].llm_equal

LLM-as-judge result/flag for this task