Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

Jongha Kim; Jihwan Park; Jinyoung Park; Jinyoung Kim; Sehyung Kim; Hyunwoo J. Kim

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

Jongha Kim, Jihwan Park, Jinyoung Park, Jinyoung Kim, Sehyung Kim, Hyunwoo J. Kim

TL;DR

Experimental results and analyses show that SpeaQ effectively trains ‘specialized’ queries, which better utilize the capacity of a model, resulting in consistent performance gains with ‘zero’ additional inference cost across multiple VRD models and benchmarks.

Abstract

Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ.

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

TL;DR

Abstract

Paper Structure (23 sections, 12 equations, 8 figures, 12 tables)

This paper contains 23 sections, 12 equations, 8 figures, 12 tables.

Introduction
Related Works
Transformers for Visual Relationship Detection
Effective training of VRD models
Method
Preliminary
Groupwise Query Specialization
Frequency-based predicate and query grouping.
Quality-Aware Multi-Assignment
Experiments
Datasets
Experimental Results
Results on Visual Genome.
Analysis
Conclusion
...and 8 more sections

Figures (8)

Figure 1: Overview of the proposed SpeaQ. SpeaQ consists of two key components: Groupwise Query Specialization and Quality-Aware Multi-Assignment. Groupwise Query Specialization (Sec. \ref{['sec:sec_query_grouping']}) divides predicates and queries into disjoint predicate groups and query groups and assigns a GT in a specific predicate group only to a query in the corresponding query group, therefore designating a specialized role to a query. Quality-Aware Multi-Assignment (Sec. \ref{['sec:sec_qg_dla']}) adaptively assigns a GT to a different number of predictions considering overall prediction quality on a subject, object, and predicate to provide richer training supervision to predictions that are close to a GT.
Figure 2: Prediction frequency per group. The closer to the GT, the better.
Figure 3: mR@100 per group. The higher, the better.
Figure 5: Qualitative results on Visual Genome dataset. Predictions of the baseline and the model trained with SpeaQ are visualized along with corresponding ground-truths. Correct and wrong prediction results are marked green and red, respectively.
Figure 6: Qualitative examples of various assignment strategies. Bounding boxes and labels of two ground-truths (GT 1, 2) and five prediction results (1-5) are illustrated. Note that a prediction label is only specified in case it differs from the most relevant GT. Ideal assignment results are colored green, while wrong assignment results are colored red.
...and 3 more figures

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

TL;DR

Abstract

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)