Table of Contents
Fetching ...

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li

TL;DR

Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), which outperform baselines that share the same language model and training data on 12 benchmarks, and shows the zero-shot capability of TUNA when provided with specific datastores.

Abstract

Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

TL;DR

Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), which outperform baselines that share the same language model and training data on 12 benchmarks, and shows the zero-shot capability of TUNA when provided with specific datastores.

Abstract

Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.
Paper Structure (25 sections, 11 figures, 11 tables)

This paper contains 25 sections, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Examples on LLaVA-W (left), and quantitative comparison (right). Imprecise low-quality answers are marked in red and high-quality parts are marked in green. Popular open-source MLLMs fail to identify the mangosteen (the first question), and list non-existent objects such as 'knife' and incorrect quantities and arrangements, while ours correctly identify 'mangosteens' with descriptions in detail.
  • Figure 2: Top: the process of translating image embeddings to text embeddings (LLaVA liu2024visual). Bottom: Image classification accuracy of CLIP radford2021learning and MLLMs built on it.
  • Figure 3: Examples of tags derived from parsing and NER results.
  • Figure 4: Framework of TUNA. Left: overall architecture. Given a language instruction, an image, and retrieved tags, they are transformed into tokens and input to the LLM. Only CLIP encoders are frozen. Right: architecture of the image-aware tag encoder, which produces tag representations with retrieved tags and the input image.
  • Figure 5: VQA examples of TUNA. For each example, we show top 3 retrieved images to save space. We show all tag set associated with all retrieved images as well as their tuned weights in heat map, where the brightest region for the highest weight 1 and darkest region for the lowest weight 0 (Zoom in for better view). Correct answers are marked green and wrong ones in red. More examples are available in Appendix \ref{['sec:appendix:E']}.
  • ...and 6 more figures