MMDocBench

Benchmarking Large Vision-Language Models (LVLMs) for
Fine-Grained Visual Document Understanding

Fengbin ZHU1, Ziyang LIU2, Xiang Yao NG2, Haohui WU1
Wenjie WANG1, Fuli FENG3, Chao WANG2, Huanbo LUAN2, Tat-Seng CHUA1,
1National University of Singapore
26Estates Pte Ltd, Singapore
3University of Science and Technology of China
overview of MMDocBench

Overview of MMDocBench, which is designed to holistically assess the fine-grained visual perception and reasoning abilities of Large Vision-Language Models (LVLMs) through various OCR-free document understanding tasks.

Introduction

To holistically assess Large Vision-Language Models (LVLMs) fine-grained visual understanding capabilities, we propose using document images with multi-granularity and multi-modal information to supplement natural images. In this light, we construct MMDocBench, a benchmark with various OCR-free document understanding tasks for the evaluation of fine-grained visual perception and reasoning abilities. MMDocBench defines 15 main tasks and 48 sub tasks, covering various document images such as research papers, receipts, financial reports, Wikipedia tables, charts, and infographics.

Additionally, in MMDocBench, we also provide annotations of supporting regions (i.e., bounding boxes) within the image for each QA pair. The supporting regions enable the evaluation of whether LVLMs have correctly grounded their predictions on the associated regions in the image, leading to a more comprehensive evaluation. The output of supporting regions can offer significant practical value, making the LVLMs' responses more informative and interpretable while allowing for rapid cross-checking between the answer and the image. Finally, MMDocBench contains 2,400 document images, involving 4,338 QA pairs with 11,353 supporting regions.

Based on MMDocBench, we conduct extensive experiments using 13 open-source and 3 proprietary advanced LVLMs, assessing their strengths and weaknesses across different tasks and document image types. We believe MMDocBench can enable a thorough and multi-faceted evaluation of fine-grained visual document understanding of LVLMs, thereby facilitating LVLMs' future advancement.

MMDocBench Dataset

Overview

taxonmy and statistic

Taxonomy and statistics of MMDocBench.

Supporting Regions Analysis

The position distribution of all supporting regions in MMDocBench.

The area distribution (after applying a logarithmic transformation with a base of 10) of all supporting regions in MMDocBench.

Experiment Results

Leaderboard

EM: Exact Match; IOU: Intersection over Union
All: Overall; VP: Fine-Grained Visual Perception; VR: Fine-Grained Visual Reasoning

Model Performance Analysis on Different Tasks

Answer prediction performance comparison of LVLMs across all main tasks with F1 score.

Region prediction performance comparison of LVLMs across all main tasks with IOU score.

Citation


      @misc{zhu2024mmdocbench
        title={MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding}, 
        author={Fengbin Zhu and Ziyang Liu and Xiang Yao Ng and Haohui Wu and Wenjie Wang and Fuli Feng and Chao Wang and Huanbo Luan and Tat Seng Chua},
        year={2024},
        eprint={2410.21311},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2410.21311}, 
      }