Table of Contents
Fetching ...

Relation-Aware Graph Attention Network for Visual Question Answering

Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu

TL;DR

The paper tackles visual question answering by addressing the need to reason about inter-object relations in images. It introduces a relation-aware graph attention framework (RaAM) that constructs explicit and implicit inter-object graphs and learns question-adaptive relation representations via graph attention networks. The method is designed as a generic augmentation that can plug into existing VQA architectures and fuses relation-aware features with question encodings to predict answers. Empirical results on VQA 2.0 and VQA-CP v2 show consistent gains across baselines and datasets, underscoring the importance of modeling both explicit and implicit relations for robust VQA performance. The work highlights a practical, flexible approach to relational reasoning in multimodal tasks with broad applicability beyond VQA.

Abstract

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

Relation-Aware Graph Attention Network for Visual Question Answering

TL;DR

The paper tackles visual question answering by addressing the need to reason about inter-object relations in images. It introduces a relation-aware graph attention framework (RaAM) that constructs explicit and implicit inter-object graphs and learns question-adaptive relation representations via graph attention networks. The method is designed as a generic augmentation that can plug into existing VQA architectures and fuses relation-aware features with question encodings to predict answers. Empirical results on VQA 2.0 and VQA-CP v2 show consistent gains across baselines and datasets, underscoring the importance of modeling both explicit and implicit relations for robust VQA performance. The work highlights a practical, flexible approach to relational reasoning in multimodal tasks with broad applicability beyond VQA.

Abstract

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

Paper Structure

This paper contains 24 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overview of the RaAM model. Both explicit relations (semantic and spatial) and implicit relations are considered. The proposed relation encoder captures question-adaptive object interactions via the use of Graph Attention Networks (GATs).
  • Figure 2: An overview of the proposed RaAM model for visual question answering. Faster R-CNN is first employed to detect a set of object regions. Next, region-level features are fed into different relation encoders to learn relation-aware visual features, which will be fused with question representation to predict an answer.
  • Figure 3: Illustration of (a) spatial relations and (b) semantic relations. The green arrows denote the direction of relations (subject -> object). Labels in green boxes are class labels of relations. Red and Blue boxes contain class labels of objects.
  • Figure 4: Visualization of attention maps learned from ablated instances on the VQA task: (a) Semantic Relation, (b) Spatial Relation and (c) Implicit Relation. The three bounding boxes shown in each image are the top-3 attended regions.
  • Figure 5: Visualization of different types of visual object relations on the VQA task: (a) Spatial Relation, (b) Semantic Relation and (c) Implicit Relation. The three bounding boxes shown in each image are the top-3 attended regions. The green arrows indicate relations from object to subject. Labels and numbers in green boxes are class labels for semantic relations and attention weights for implicit relations.