Table of Contents
Fetching ...

UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation

Xinyao Liao, Wei Wei, Dangyang Chen, Yuanyuan Fu

TL;DR

This paper tackles scene graph generation by addressing weak entanglement in one-stage methods. It introduces UniQ, a unified transformer decoder that takes task-specific queries for subjects, objects, and predicates, enabling decoupled feature extraction while enabling triplet-wide coupling via a triplet self-attention mechanism. Key innovations include relation-aware task-specific queries, triplet-coupled self-attention, and decoupled parallel decoding, combined with end-to-end training and one-to-many assignment to boost positives and learning efficiency. Empirical results on Visual Genome VG150 demonstrate that UniQ achieves superior performance with fewer parameters than prior one- and two-stage methods, and ablations validate the effectiveness of each component and the transferability of the STS paradigm.

Abstract

Scene Graph Generation(SGG) is a scene understanding task that aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets <subject, predicate, object>. This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.

UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation

TL;DR

This paper tackles scene graph generation by addressing weak entanglement in one-stage methods. It introduces UniQ, a unified transformer decoder that takes task-specific queries for subjects, objects, and predicates, enabling decoupled feature extraction while enabling triplet-wide coupling via a triplet self-attention mechanism. Key innovations include relation-aware task-specific queries, triplet-coupled self-attention, and decoupled parallel decoding, combined with end-to-end training and one-to-many assignment to boost positives and learning efficiency. Empirical results on Visual Genome VG150 demonstrate that UniQ achieves superior performance with fewer parameters than prior one- and two-stage methods, and ablations validate the effectiveness of each component and the transferability of the STS paradigm.

Abstract

Scene Graph Generation(SGG) is a scene understanding task that aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets <subject, predicate, object>. This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.
Paper Structure (27 sections, 10 equations, 5 figures, 5 tables)

This paper contains 27 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of Baselines. We compare the number of parameters and Recall@100 among three baselines in Section \ref{['section:3']} and our method UniQ. It demonstrates the formulation of STS baseline that we adopt in UniQ achieves better performance with fewer parameters.
  • Figure 2: Formulation of baselines. (a) Single decoder with task-agnostic queries: The single decoder takes triplet queries as input. Each query corresponds to predicting the whole triplet. (b) Single decoder with task-specific queries: The task-specific queries are input into a shared decoder. Each type of query responds to each sub-task. (c) Three decoders with task-specific queries: Three decoders separately predict each component of triplets.
  • Figure 3: Architecture Illustration. (a) Image Feature Extractor takes images as input and maps them to condensed image representations by a CNN backbone and a transformer encoder. (b) Query Generator depicts how to form task-specific relation-aware queries for decoding. (c) Relational Triplet Predictor has a triplet self-attention for capturing interaction within the triplet and a unified decoder for separately extracting visual features of each sub-task. (d) Output is generated by FFN.
  • Figure 4: Ablation on the transferability of the STS paradigm.
  • Figure 5: Qualitative Results. We visualize the decoder's attention map of STA baseline (without task-specific queries) and our UniQ. The first column represents the attention maps for relationships generated by STA baseline. The second to fourth columns represent the attention maps for subjects, predicates, and objects respectively. The yellow rectangles denote the bounding boxes of subjects and the red rectangles denote the bounding boxes of objects.