Logo CMMMU

A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

geometric reasoning

Overview of the CMMMU dataset: It is distinguished by its comprehensiveness, encompassing 12K college-level problems across six broad disciplines and 30 college subjects.

🔔News

🔥[2024-04-26]: Thanks to the support of VLMEvalKit team, now everyone can use VLMEvalKit to easily conduct evaluations!

🔥[2024-03-14]: Thanks to the support of lmms-eval team, now everyone can use lmms-eval to easily conduct evaluations!

🌟[2024-03-06]: Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! 😆

Introduction

As the capabilities of large multimodal models LMMs continue to advance, evaluating the performance of LMM emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in a Chinese context. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art Design, Business, Science, Health Medicine, Humanities Social Science, and Tech Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU, like MMMU, focuses on complex perceptron and reasoning with domain-specific knowledge. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieve accuracies of 42%, indicating large room for improvement.

Logo CMMMU Benchmark

Overview

We present the Chinese Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark (CMMMU), an innovative benchmark specifically developed to test the capabilities of Large Multimodal Models (LMMs) in understanding and reasoning across multiple disciplines in the Chinese context. This benchmark encompasses a wide range of subjects, including Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, representing a comprehensive evaluation platform. The questions, totaling 12,000, are meticulously gathered from a variety of Chinese college-level sources, such as exams, quizzes, and textbooks. Each question is further categorized into detailed subfields and image types, providing a deep analysis of the types of questions that pose the most challenge to LMMs.

algebraic reasoning

CMMMU is structured to rigorously assess three crucial skills of LMMs: perception, domain-specific knowledge, and reasoning, in a bilingual context. The primary objective is to examine how these models perceive and interpret information across different modalities, and more importantly, how they apply reasoning combined with subject-specific knowledge to arrive at solutions. This endeavor addresses the need for benchmarks in guiding the development of bilingual LMMs towards achieving expert-level performance in artificial intelligence.

A key aspect of CMMMU is its focus on highlighting the challenges faced by multimodal foundation models, particularly in the Chinese language setting. These challenges include expert-level visual perception and sophisticated reasoning with domain-specific knowledge. The tasks in CMMMU demand not only the processing of diverse image types but also require models to demonstrate proficiency in integrating multimodal analysis with domain-specific expertise. This benchmark goes far beyond basic visual recognition, underscoring the need for advanced approaches in LMM development, bridging the gap identified in existing models like GPT-4V and Qwen-VL-Plus. CMMMU sets a new standard in evaluating LMMs, offering a pathway towards developing truly expert-level bilingual artificial intelligence systems.

Comparisons with Existing Benchmarks

To further distinguish the difference between dataset and other existing ones, we elaborate the benchmark details in Figure. From the breadth perspective, the prior benchmarks are heavily focused on daily knowledge and common sense. The covered image format is also limited. Our benchmark aims to cover college-level knowledge with 30 image formats including diagrams, tables, charts, chemical structures, photos, paintings, geometric shapes, music sheets, medical images, etc. In the depth aspect, the previous benchmarks normally require commonsense knowledge or simple physical or temporal reasoning. In contrast, our benchmark requires deliberate reasoning with college-level subject knowledge.

algebraic reasoning

Sampled CMMMU examples from each discipline. The questions and images need expert-level knowledge to understand and reason.

Statistics

Experiment Results

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. For all models, we use the default prompt provided by each model for multi-choice or open QA, if available. If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the later zero-shot experiment.

Multimodal Language-Only Proprietary
Reset Test Overall Validation Overall Art & Design Business Science Health & Medicine Human & Social Sci. Tech & Eng.
GPT-4o(202405130) 53.1 52.2 69.6 36.3 40.9 46.8 44.2 41.5
GPT-4V 43.7 42.5 61.0 36.3 40.9 46.8 44.2 41.5
Marco-VL-Plus 40.6 43.4 66.7 21.9 36.8 46.2 47.1 38.3
Qwen-VL-Plus 36.8 39.5 61.5 23.2 32.8 40.5 43.4 33.3
Yi-VL-34B 36.5 36.2 62.9 19.1 31.5 42.1 42.5 34.5
Weitu-VL-1.0-15B 35.3 36.0 55.5 19.4 30.9 42.4 39.7 33.8
Yi-VL-6B 35.0 35.8 58.0 19.9 32.3 39.3 40.6 32.1
InternVL-Chat-V1.1 34.0 34.7 56.7 19.7 28.6 39.2 39.6 32.3
Qwen-VL-Chat 31.3 30.7 52.6 18.5 26.9 33.4 34.1 31.4
SPHINX-MoE 29.5 29.3 41.7 20.3 27.8 28.9 31.8 30.9
InternVL-Chat-ViT-6B-Vicuna-7B 26.7 26.4 39.7 13.8 23.0 31.7 26.5 28.5
InternVL-Chat-ViT-6B-Vicuna-13B 26.1 27.4 38.5 13.9 22.1 30.2 29.8 27.5
Emu2-Chat 24.5 23.8 35.3 11.7 22.1 25.5 28.0 27.1
CogAgent-Chat 23.6 24.6 33.8 14.1 20.6 26.3 24.8 25.3
Chinese-LLaVa 23.4 25.5 34.4 11.7 21.6 25.5 26.3 24.7
VisCPM 22.7 25.2 37.7 11.3 19.1 26.1 24.0 23.7
mPLUG-Owl2 22.2 20.8 30.4 13.3 19.6 25.2 24.7 23.4
Yi-6B + OCR 26.8 28.4 33.4 16.9 24.8 32.3 33.2 25.5
Qwen-7B + OCR 26.1 27.0 44.6 14.3 22.1 29.3 29.8 25.4
Qwen-7B 25.1 24.7 43.8 12.6 20.7 30.5 26.9 24.5
Baichuan-7B + OCR 24.7 25.3 40.2 15.2 21.0 27.9 30.7 22.8
Baichuan-7B 24.3 26.0 42.7 12.6 19.6 28.0 27.8 23.9
Yi-6B 24.2 25.6 26.3 15.0 23.4 29.1 27.0 24.7
DeepSeek-7B + OCR 23.2 25.2 41.2 13.2 19.4 26.1 26.5 21.8
DeepSeek-7B 21.9 22.3 41.3 11.2 18.3 23.5 24.7 21.3
Frequent Choice 26.0 24.1 36.2 11.8 23.9 30.2 28.5 27.7
Random Choice 21.6 21.6 32.9 9.1 18.8 23.8 23.8 23.9

Overall results of different models on the CMMMU test set. The best-performing model is in-bold, and the best open-source model is underlined.

Different Question Type

We conduct result decomposition across question types, as shown in table below. We notice that Qwen-VL-Plus does not well on True or False questions, indicating that Qwen-VL-Plus may not understand the prompt for answering True or False questions. It might be a free lunch for Qwen-VL-Plus to improve its performance on CMMMU. We further point out that the disparity between Yi-VL Series, Qwen-VL-Plus, and GPT-4V is mainly because of their capacity difference for answering Multiple-choice questions.

Result decomposition across question type.

Different Difficulty Levels

We conduct result decomposition across question difficulties, as shown in table below. Notably, there is a larger gap between the best open-source LMM, i.e. Yi-VL-34B, and GPT-4V when facing the medium and hard questions. It is another strong evidence of the observation that the key disparity between Open-source LMMs and GPT-4V is the capacity to calculate and reason given complex conditions.

Result decomposition across question difficulty levels. bold results in LMMs indicate the best results for all models, and the blue results indicate the best results among the open-source models.

Error Analysis

We conduct a thorough investigation into the shortcomings of GPT-4V through a critical evaluation of its performance. Our focus is on a set of 150 errors randomly selected from the model's outputs, scrutinized by a team of skilled annotators. These experts have dissected each instance, identifying the underlying reasons for the inaccuracies, relying on their expertise and available reference explanations.

Error Examples

Correct Examples

BibTeX


      @article{zhang2024cmmmu,
        title={CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark},
        author={Ge, Zhang and Xinrun, Du and Bei, Chen and Yiming, Liang and Tongxu, Luo and Tianyu, Zheng and Kang, Zhu and Yuyang, Cheng and Chunpu, Xu and Shuyue, Guo and Haoran, Zhang and Xingwei, Qu and Junjie, Wang and Ruibin, Yuan and Yizhi, Li and Zekun, Wang and Yudong, Liu and Yu-Hsuan, Tsai and Fengji, Zhang and Chenghua, Lin and Wenhao, Huang and Jie, Fu},
        publisher={GitHub},
        journal={GitHub repository},
        howpublished={https://github.com/CMMMU-Benchmark/CMMMU}
}