Table of Contents
Fetching ...

Object Attribute Matters in Visual Question Answering

Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang

TL;DR

This paper tackles the challenge of aligning visual and linguistic information in Visual Question Answering (VQA) by leveraging object attributes as explicit semantic anchors. It introduces OAM-VQA, a framework with an Attribute Fusion Module (AFM) that builds a multimodal graph integrating object attributes and visual features, and a Contrastive Knowledge Distillation Module (CKDM) that injects implicit knowledge from vision-language pre-trained models via a contrastive loss to sharpen attribute representations. The method yields significant improvements on image-understanding and out-of-distribution (OOD) tasks across six datasets, with notable gains in counting and complex scene understanding, demonstrating robust visual-language alignment at the object level. The work demonstrates that object attributes, when fused and distillation-augmented, substantially enhance VQA performance and generalization, suggesting a promising direction for attribute-centric multimodal reasoning. The training objective combines the VQA loss $L_{vqa}$ with the contrastive loss $L_{cl}$ to jointly optimize answer prediction and attribute knowledge alignment, leading to practical improvements in robustness and interpretability.

Abstract

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.

Object Attribute Matters in Visual Question Answering

TL;DR

This paper tackles the challenge of aligning visual and linguistic information in Visual Question Answering (VQA) by leveraging object attributes as explicit semantic anchors. It introduces OAM-VQA, a framework with an Attribute Fusion Module (AFM) that builds a multimodal graph integrating object attributes and visual features, and a Contrastive Knowledge Distillation Module (CKDM) that injects implicit knowledge from vision-language pre-trained models via a contrastive loss to sharpen attribute representations. The method yields significant improvements on image-understanding and out-of-distribution (OOD) tasks across six datasets, with notable gains in counting and complex scene understanding, demonstrating robust visual-language alignment at the object level. The work demonstrates that object attributes, when fused and distillation-augmented, substantially enhance VQA performance and generalization, suggesting a promising direction for attribute-centric multimodal reasoning. The training objective combines the VQA loss with the contrastive loss to jointly optimize answer prediction and attribute knowledge alignment, leading to practical improvements in robustness and interpretability.

Abstract

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.
Paper Structure (27 sections, 14 equations, 5 figures, 6 tables)

This paper contains 27 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An illustration of our motivation. Compared with previous multimodal content, object-level attributes are indispensable in both object counting (a) and scene understanding (b).
  • Figure 2: The overview of our attribute-centric approach. Visual description module generates descriptive text for object attributes. Attribute fusion module establishes a multimodal graph and fuses attribute features with visual features by passing messages between two subgraphs. Contrastive knowledge distillation module introduces a series of implicit knowledge to supplement information that cannot be covered in the attributes. On this basis, the contrastive loss is adopted to further strengthen and enrich the representation of attribute features. The blue or red arrows between nodes in the two graphs represent the direction of information flow.
  • Figure 3: Performance with different types of visual descriptions. Vinvl generates object-level attributes, BLIP2 generates image-level global captions and mPLUG-Owl generates image-level detailed descriptions.
  • Figure 4: Performance with different question types. The red bar represents our approach, and the blue bar represents LXMERT.
  • Figure 5: Examples of four different question types on the COCO-QA dataset.