Table of Contents
Fetching ...

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

TL;DR

This work proposes an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs and indicates that attribute supervision makes vision-language models accurately localize attribute-specified objects.

Abstract

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

TL;DR

This work proposes an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs and indicates that attribute supervision makes vision-language models accurately localize attribute-specified objects.

Abstract

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.
Paper Structure (24 sections, 8 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of the effect of various tag supervisions (e.g., object and attribute) for open-vocabulary semantic segmentation. The vanilla CLIP struggles in localizing an attribute-specified object. When introducing object supervision (as depicted in the third column), CLIP can focus on the more accurate region of text-specified object (e.g., cat). In addition, adding attribute supervision brings CLIP a stronger understanding of visual attribute-related concepts (as depicted in the fourth column). Best viewed in color.
  • Figure 2: Performance Comparison on various semantic segmentation benchmarks. TagAlign outperforms existing methods by a large margin on all these benchmarks. Best viewed in color.
  • Figure 3: The framework of TagAlign. At the core of TagAlign are two key components: a) LLM-aided tag parsing that parses the image captions into diverse tags (i.e., objects and attributes); b) Multi-tag classification that utilizes the parsed tags to supervise the model training. Best viewed in color.
  • Figure 4: Illustration of the most frequent tags in the (a) object tag list and (b) attribute tag list. The font size corresponds to the tag's frequency of occurrence in the tag list: the more frequent tag uses a larger font size. (c) Statistic of object & attribute tags from LLM parser. The distribution of the tags exhibits a long-tailed pattern. Best viewed in color.
  • Figure 5: Comparison of tags extracted with text parsing methods (i.e., NLTK and LLM). The tags in red are misidentified by the parser. Best viewed in color.
  • ...and 3 more figures