Region-aware Image-based Human Action Retrieval with Transformers

Hongsong Wang; Jianhua Zhao; Jie Gui

Region-aware Image-based Human Action Retrieval with Transformers

Hongsong Wang, Jianhua Zhao, Jie Gui

TL;DR

This work tackles image-based human action retrieval by learning a multi-level action representation from an anchored person, nearby contextual regions, and the global image. It introduces RIART, which uses a region-aware approach and a Fusion Transformer to model interactions among person, context, and overall scene, yielding robust retrieval and recognition performance. Key contributions include establishing benchmarks for image-based action retrieval, a novel ranking-based context region selection, and a Transformer-based fusion module with dedicated type and position embeddings. The method demonstrates state-of-the-art results on Stanford-40 and PASCAL VOC 2012 Action and provides comprehensive ablations and visualizations that underscore the complementary value of the three representation levels. This approach advances practical action understanding in static images and offers a principled pathway for open-set action retrieval in real-world applications.

Abstract

Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present an end-to-end model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A novel fusion transformer module is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method significantly outperforms previous approaches for image-based action retrieval.

Region-aware Image-based Human Action Retrieval with Transformers

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 6 figures, 4 tables)

This paper contains 19 sections, 7 equations, 6 figures, 4 tables.

Introduction
Related Work
Image-Based Action Recognition
Human Action Retrieval
Instance-level Image Retrieval
Method
Problem Formulation
Representation of Anchored Person
Representations of Contextual Regions
Fusion Transformer
Training Strategy
Experiments
Datasets and Implementation Details
Experimental Setup
Results of Action Retrieval
...and 4 more sections

Figures (6)

Figure 1: The illustration of image-based human action retrieval. The red box indicates the anchored region of the person who performs the action.
Figure 2: The overall architecture of the proposed Region-aware Image-based human Action Retrieval with Transformer (RIART). The RIART takes an image and the corresponding anchored region of the human bounding box as input, generating a human action representation that can be utilized for action retrieval or recognition. To acquire contextual regions associated with the action, an off-the-shelf proposal generation module is employed to generate object proposals, which are then filtered and ranked by a contextual regions ranking module. Multi-level action representations are obtained, specifically, the representation of the anchored person $F_a$, representations of contextual regions $\mathbf{F_i}$, and the global image-level representation $F_g$. Subsequently, these representations are fused by a fusion transformer module to yield the final representation $F_f$.
Figure 3: Examples query image for action retrieval. Human bounding boxes indicate the person performing a certain action. (a) Simple image that contains only one human action. (b) Complex image that contains multiple actions from different persons. Note that actions of different persons in the same image can be different.
Figure 4: The architecture of Fusion Transformer module. The input $F_c$ denotes multi-level action representations, i.e., representation of the anchored person $F_a$, representations of contextual regions $\mathbf{F_i}$, and the global image-level representation $F_g$. The Fusion Transformer aims to capture the complementary information among these diverse input representations and output a learned joint representation. To facilitate the learning of this complementary action representation, type embeddings and position embeddings are employed. Similar to the standard Transformer architecture, the Fusion Transformer module consists of $N$ blocks.
Figure 5: (a) Ablation studies of features. (b) Ablation studies of embeddings in the fusion transformer.
...and 1 more figures

Region-aware Image-based Human Action Retrieval with Transformers

TL;DR

Abstract

Region-aware Image-based Human Action Retrieval with Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)