Table of Contents
Fetching ...

Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia

TL;DR

This work tackles multi-sourced compositional generalization in Visual Question Answering by introducing a retrieval-augmented training framework that aligns semantic primitives across linguistic and visual modalities. It constructs separate linguistic and visual primitive databases, retrieves semantically similar primitives during training, and aggregates retrieved features with the originals to learn unified cross-modal representations. A new GQA-MSCG dataset is released to evaluate MSCG across LL, VV, LV compositions and their co-occurrences, with extensive experiments showing improvements in MSCG and IID generalization for both small and large models. The approach is validated across GQA and VQA v2, showing notable gains on cross-modal generalization and robust behavior with ablations and parameter analyses, and is complemented by qualitative evidence of improved reasoning on novel compositions.

Abstract

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

Multi-Sourced Compositional Generalization in Visual Question Answering

TL;DR

This work tackles multi-sourced compositional generalization in Visual Question Answering by introducing a retrieval-augmented training framework that aligns semantic primitives across linguistic and visual modalities. It constructs separate linguistic and visual primitive databases, retrieves semantically similar primitives during training, and aggregates retrieved features with the originals to learn unified cross-modal representations. A new GQA-MSCG dataset is released to evaluate MSCG across LL, VV, LV compositions and their co-occurrences, with extensive experiments showing improvements in MSCG and IID generalization for both small and large models. The approach is validated across GQA and VQA v2, showing notable gains on cross-modal generalization and robust behavior with ablations and parameter analyses, and is complemented by qualitative evidence of improved reasoning on novel compositions.

Abstract

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

Paper Structure

This paper contains 16 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Multi-sourced novel compositions in the context of VQA.
  • Figure 2: The overall framework of the proposed framework.
  • Figure 3: Level-1 samples in the GQA-MSCG dataset.
  • Figure 4: Level-2 samples and Level-3 samples in the GQA-MSCG dataset.
  • Figure 5: Parameter analysis using CFR as the baseline model on the GQA-MSCG dataset.
  • ...and 1 more figures