A dual contrastive framework
Yuan Sun, Zhao Zhang, Jorge Ortiz
TL;DR
Region-level captioning in vision-language models remains challenging due to coarse pretraining and the difficulty of aligning frozen LLMs with image representations. AlignCap introduces a dual-contrastive framework featuring a Latent Feature Refinement Module that converts coarse vocabulary tagging into image-conditioned fine-grained representations and a Semantic Space Alignment module that aligns multimodal inputs with a frozen LLM, augmented by a General Object Detection pipeline for improved spatial reasoning. The approach formalizes an overall loss $\mathcal{L}_{\text{AlignCap}} = \alpha \cdot \mathcal{L}_{\text{tag}} + \beta \cdot \mathcal{L}_{\text{cap}} + \gamma \cdot \mathcal{L}_{\text{cond}} + \lambda \cdot \mathcal{L}_{\text{multi}}$, integrating tagging, captioning, conditioned latent alignment, and multi-modal alignment objectives. Empirical results show significant improvements on region-level captioning tasks, demonstrating enhanced spatial awareness, richer region-specific descriptions, and better integration with frozen LLMs for practical multimodal reasoning.
Abstract
In current multimodal tasks, models typically freeze the encoder and decoder while adapting intermediate layers to task-specific goals, such as region captioning. Region-level visual understanding presents significant challenges for large-scale vision-language models. While limited spatial awareness is a known issue, coarse-grained pretraining, in particular, exacerbates the difficulty of optimizing latent representations for effective encoder-decoder alignment. We propose AlignCap, a framework designed to enhance region-level understanding through fine-grained alignment of latent spaces. Our approach introduces a novel latent feature refinement module that enhances conditioned latent space representations to improve region-level captioning performance. We also propose an innovative alignment strategy, the semantic space alignment module, which boosts the quality of multimodal representations. Additionally, we incorporate contrastive learning in a novel manner within both modules to further enhance region-level captioning performance. To address spatial limitations, we employ a General Object Detection (GOD) method as a data preprocessing pipeline that enhances spatial reasoning at the regional level. Extensive experiments demonstrate that our approach significantly improves region-level captioning performance across various tasks
