VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan; Wenpo Song

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan, Wenpo Song

TL;DR

The proposed VisionPangu is a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision, and improves semantic coherence and descriptive richness without relying on aggressive model scaling.

Abstract

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

TL;DR

Abstract

Paper Structure (14 sections, 3 equations, 1 figure, 2 tables)

This paper contains 14 sections, 3 equations, 1 figure, 2 tables.

Introduction
Related Work
Vision-Language Pretraining
Large Multimodal Models.
Dense Image Description and Captioning.
Methodology
Model Architecture
Training Strategy
Experiments
Experimental Setup
Benchmark Evaluation
Captioning Evaluation
Captioning Benchmark Comparison
Conclusion

Figures (1)

Figure 1: Overview of VisionPangu. The InternVL-derived vision encoder extracts visual tokens $Z_v$, which are projected into the language embedding space via a lightweight MLP projector and processed by the OpenPangu-Embedded-1B language model to generate detailed image captions.

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

TL;DR

Abstract

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Authors

TL;DR

Abstract

Table of Contents

Figures (1)