Table of Contents
Fetching ...

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Junyu Lu, Dixiang Zhang, Songxin Zhang, Zejian Xie, Zhuoyang Song, Cong Lin, Jiaxing Zhang, Bingyi Jing, Pingjian Zhang

TL;DR

Lyrics tackles the problem of fine-grained vision-language alignment in LVLMs by jointly leveraging a visual refiner to extract semantic-aware visual objects and a Multi-scale Querying Transformer to bridge image signals with an LLM. The two-stage training—multi-task pre-training with ITC, ITM, ICG, and MSP, followed by vision-to-language generative fine-tuning with LoRA—enables precise object-level grounding and robust instruction-driven dialogue. Empirical results across 13 datasets and 11 benchmark toolkits show state-of-the-art or competitive performance on image captioning, VQA, and referring expression tasks, with notable reductions in visual hallucinations and improved multi-turn reasoning. The work demonstrates that integrating localized visual signals and spatial representations into a querying-then-generation pipeline yields strong generalization for fine-grained perception, grounding, and conversation in real-world scenarios.

Abstract

Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. However, the absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. In this paper, we propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration. Building on the foundation of BLIP-2, Lyrics infuses local visual features extracted from a visual refiner that includes image tagging, object detection and semantic segmentation modules into the Querying Transformer, while on the text side, the language inputs equip the boundary boxes and tags derived from the visual refiner. We further introduce a two-stage training scheme, in which the pre-training stage bridges the modality gap through explicit and comprehensive vision-language alignment targets. During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects. Our approach achieves robust performance on 13 datasets across various vision-language tasks, and demonstrates promising multi-modal understanding, perception and conversation capabilities in 11 scenario-based benchmark toolkits.

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

TL;DR

Lyrics tackles the problem of fine-grained vision-language alignment in LVLMs by jointly leveraging a visual refiner to extract semantic-aware visual objects and a Multi-scale Querying Transformer to bridge image signals with an LLM. The two-stage training—multi-task pre-training with ITC, ITM, ICG, and MSP, followed by vision-to-language generative fine-tuning with LoRA—enables precise object-level grounding and robust instruction-driven dialogue. Empirical results across 13 datasets and 11 benchmark toolkits show state-of-the-art or competitive performance on image captioning, VQA, and referring expression tasks, with notable reductions in visual hallucinations and improved multi-turn reasoning. The work demonstrates that integrating localized visual signals and spatial representations into a querying-then-generation pipeline yields strong generalization for fine-grained perception, grounding, and conversation in real-world scenarios.

Abstract

Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. However, the absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. In this paper, we propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration. Building on the foundation of BLIP-2, Lyrics infuses local visual features extracted from a visual refiner that includes image tagging, object detection and semantic segmentation modules into the Querying Transformer, while on the text side, the language inputs equip the boundary boxes and tags derived from the visual refiner. We further introduce a two-stage training scheme, in which the pre-training stage bridges the modality gap through explicit and comprehensive vision-language alignment targets. During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects. Our approach achieves robust performance on 13 datasets across various vision-language tasks, and demonstrates promising multi-modal understanding, perception and conversation capabilities in 11 scenario-based benchmark toolkits.
Paper Structure (28 sections, 5 equations, 6 figures, 7 tables)

This paper contains 28 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The two-stage training framework of Lyrics, with the MQ-Former to bridge the modality gap between the image encoder and the visual refiner to the LLM. The first stage bootstraps vision-language representation alignment via multi-task pre-training. The second stage bootstraps instructed vision-language generative learning via semantic-aware visual objects.
  • Figure 2: (Left) Model architecture of Multi-scale Querying Transformer (MQ-Former), The frozen global and local visual features are inserted into every image transformer block to interact with learnable quries. (Right) The pipeline of visual refiner that consists of a image tagging module, an object detection module and a semantic segmentation module.
  • Figure 3: The learning objectives in vision-language representation alignment. We jointly optimize four objectives which enforce the queries (a set of learnable embeddings) to extract visual representation relevant to the text information. The self-attention masking strategy for each objective is used to control query-text interaction.
  • Figure 4: (a) The pre-training data scaling performance on VQAv2, RefCOCOg (testset), LLaVA-Bench and HallusionBench. (b) The comparison of full, LoRA and frozen training in instruction fine-tuning stage.
  • Figure 5: Examples for multi-modal capabilities of Lyrics, We showcase that our method is capable of various visual-centric tasks, including multi-turn visual conversation, visual scene understanding and reasoning, commonsense-grounded image description, referential dialogue.
  • ...and 1 more figures