Table of Contents
Fetching ...

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

TL;DR

Leopard addresses the challenge of text-rich multi-image reasoning by combining a large-scale instruction-tuning dataset (Leopard-instruct) with an adaptive high-resolution encoding module that allocates visual sequence length across multiple images. The approach delivers state-of-the-art or competitive performance on 12 text-rich multi-image benchmarks, while remaining strong on single-image and general vision-language tasks, and is fully open-sourced. Key contributions include dataset construction and the adaptive encoding technique, enabling efficient high-resolution processing without excessive sequence lengths. The work has practical impact for real-world tasks like multi-page documents, slides, and web content.

Abstract

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of images. Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-image evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M training instances, all of which are fully open-sourced, demonstrating both high efficiency and effectiveness compared to models trained on large-scale in-house data. Our code and data are available at https://github.com/tencent-ailab/Leopard.

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

TL;DR

Leopard addresses the challenge of text-rich multi-image reasoning by combining a large-scale instruction-tuning dataset (Leopard-instruct) with an adaptive high-resolution encoding module that allocates visual sequence length across multiple images. The approach delivers state-of-the-art or competitive performance on 12 text-rich multi-image benchmarks, while remaining strong on single-image and general vision-language tasks, and is fully open-sourced. Key contributions include dataset construction and the adaptive encoding technique, enabling efficient high-resolution processing without excessive sequence lengths. The work has practical impact for real-world tasks like multi-page documents, slides, and web content.

Abstract

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of images. Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-image evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M training instances, all of which are fully open-sourced, demonstrating both high efficiency and effectiveness compared to models trained on large-scale in-house data. Our code and data are available at https://github.com/tencent-ailab/Leopard.
Paper Structure (21 sections, 2 equations, 9 figures, 8 tables)

This paper contains 21 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Left: A demonstration of a text-rich multi-image task, where models must reason across multiple images to answer correctly. Leopard generates the correct answer, while baselines fail. Right: Leopard outperforms three baselines on text-rich multi-image benchmarks by a large margin, while maintaining comparable performance on single-image and general evaluations.
  • Figure 2: The overall model pipeline. Given ① raw image inputs, ② we first compute the optimal allocation of sub-image numbers and splitting strategy for all images based on their resolution and aspect ratio. ③ The images undergo padding, resizing, and splitting operations. ④ Both sub-images and resized original images are then encoded into a sequence of visual features. These sequences subsequently undergo a pixel shuffle operation that concatenates every four features. ⑤ The visual features are projected into the language embedding space via a vision-language connector. Finally, the large language model then integrates these visual and language embeddings to generate responses.
  • Figure 3: Impact of the sub-image budget $M$ on the resulting model across four benchmarks. w/o indicates no partitioning into sub-images.
  • Figure 4: An illustration of the proportion of sub-datasets and domains in the proposed dataset.
  • Figure 5: The prompt used for generating Q-A pairs with rationales for slide decks data.
  • ...and 4 more figures