Human Action Recognition in Still Images Using ConViT

Seyed Rohollah Hosseyni; Sanaz Seyedin; Hasan Taheri

Human Action Recognition in Still Images Using ConViT

Seyed Rohollah Hosseyni, Sanaz Seyedin, Hasan Taheri

TL;DR

This work targets human action recognition in still images, where temporal cues are absent and CNNs struggle to capture relations between image regions. It introduces ConViT, a CNN+ViT architecture composed of a ResNet50 backbone, two Vision Transformers that learn inter-region relationships, and an optional Faster R-CNN–based human-branch to handle multi-person actions; predictions from both branches are fused via a weighted sum with $P_{final} = W_{ConViT} \times P_{ConViT} + W_{human} \times P_{human}$. Across Stanford40 and PASCAL VOC 2012 Action, ConViT yields strong gains, with the fused model achieving 95.5% and 91.5% mAP respectively, highlighting the benefit of modeling regional relationships and incorporating targeted human cues. The approach provides an effective pathway to leverage ViT-based relational reasoning within traditional CNN pipelines for still-image HAR, reducing reliance on explicit pose/annotation while improving robustness to distractors and background noise.

Abstract

Understanding the relationship between different parts of an image is crucial in a variety of applications, including object recognition, scene understanding, and image classification. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in classifying and detecting objects, they lack the capability to extract the relationship between different parts of an image, which is a crucial factor in Human Action Recognition (HAR). To address this problem, this paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT). In the proposed model, the Vision Transformer can complement a convolutional neural network in a variety of tasks by helping it to effectively extract the relationship among various parts of an image. It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mean Average Precision (mAP) and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.

Human Action Recognition in Still Images Using ConViT

TL;DR

. Across Stanford40 and PASCAL VOC 2012 Action, ConViT yields strong gains, with the fused model achieving 95.5% and 91.5% mAP respectively, highlighting the benefit of modeling regional relationships and incorporating targeted human cues. The approach provides an effective pathway to leverage ViT-based relational reasoning within traditional CNN pipelines for still-image HAR, reducing reliance on explicit pose/annotation while improving robustness to distractors and background noise.

Abstract

Paper Structure (16 sections, 3 equations, 9 figures, 3 tables)

This paper contains 16 sections, 3 equations, 9 figures, 3 tables.

Introduction
Related Works
Action Recognition
Attention
Approach
Overview
CNN
Modified ViT
Human Classification Branch
Final Prediction
Experiments
Datasets
Training Details
Comparison with Existing Methods
Ablation Study and Visualization
...and 1 more sections

Figures (9)

Figure 1: Inter-class similarity
Figure 2: Intra-class difference
Figure 3: Importance of the relations between different areas of the image
Figure 4: The Modified ViT takes a feature map as input and produces a new feature map as output, which contains information about the relationships between different regions of the image.
Figure 5: ConViT model architecture. Our proposed ConViT model consists of two parts. First, the input image is passed through a CNN, which generates a feature map capturing high-level spatial features. Second, this feature map is processed by two Vision Transformers (ViTs) that are capable of learning semantic relationships between different regions of the image.
...and 4 more figures

Human Action Recognition in Still Images Using ConViT

TL;DR

Abstract

Human Action Recognition in Still Images Using ConViT

Authors

TL;DR

Abstract

Table of Contents

Figures (9)