Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Ioanna Ntinou; Enrique Sanchez; Georgios Tzimiropoulos

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos

TL;DR

The paper tackles end-to-end action localization by removing the need for external detectors or DETR-style decoders. It introduces BMViT, an encoder-only Vision Transformer that treats each spatio-temporal output token as a potential prediction and learns to map tokens to ground-truth via a bipartite (Hungarian) matching loss, producing bounding boxes, actor presence, and multi-label actions with lightweight heads. Empirical results on AVA 2.2 show BMViT matching or surpassing two-stage MViT methods while reducing complexity, and ablations demonstrate the importance of token selection and fixed aspect-ratio inputs. The approach generalizes to other backbones and datasets, offering a practical, scalable path toward real-time, single-stage action localization without heavy decoder architecture.

Abstract

Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 6 figures, 4 tables)

This paper contains 25 sections, 3 equations, 6 figures, 4 tables.

Introduction
Related Work
Method
Preliminaries
Vision Transformers for Action Localization
DETR for Action Localization
Our solution: Bipartite-Matching ViT
Training
Inference
Remarks
Experimental Setup
Ablation studies
vs MViT + ROI align
Token Selection
Fixed vs Variable Aspect Ratio
...and 10 more sections

Figures (6)

Figure 1: Comparison between existing works and our proposed approach. a) Traditional two-stage methods work on developing strong vision transformers that are applied in the domain of Action Localization by outsourcing the bounding box detections to an external detector. ROI Align is applied to the output of the transformer using the detected bounding boxes, and the pooled features are forwarded to an MLP that returns the class predictions. b) Recent approaches in one-stage Action Localization leverage on the DETR capacity to model both the bounding boxes and the action classes. A video backbone produces strong spatio-temporal features that are handled by a DETR transformer encoder. A set of learnable queries are then used by a DETR transformer decoder to produce the final outputs. c) Our method builds a vision transformer only that is trained against a bipartite matching loss between the individual predictions given by the output spatio-temporal tokens and the ground-truth bounding boxes and classes. Our method does not need learnable queries, as well as a DETR decoder, and can combine the backbone and the DETR encoder into a single architecture.
Figure 2: The output spatio-temporal tokens are fed to 3 parallel heads. We use the central tokens to predict the bounding box and the actor likelihood while averaging the output tokens over the temporal axis to generate the action tokens. Each head comprises a small MLP that generates the output triplets. We depict the flow diagram for each head, following the standard OWL-ViT head minderer2022simple.
Figure 3: Qualitative analysis. The images on the left show the confidence maps produced by the output $16\times 16$ spatial tokens (rescaled to the image size) w.r.t. the actor likelihood for the corresponding bounding box. For the sake of clarity, we only plot the $256$ tokens corresponding to one of the frames. The highlighted tokens are those selected as positive detections. The images on the right show all the bounding boxes computed by the corresponding tokens on the left. We overlay all the bounding boxes returned by each of the $16\times 16$ output tokens. In yellow we represent the bounding boxes corresponding to the confident tokens represented on the left. All other bounding boxes (in red) are assigned to the no-class label $\varnothing$, and are thus considered as negative predictions
Figure 4: \ref{['tab:arch_instances_mvit']} The network architecture resembles that of MViTv2-S li2021improved, with the pooling layer after scale$_4$ removed. The output features are projected to $512$ dimensions and forwarded to three parallel heads that predict for each token the bounding box coordinates, the probability of the bounding box being an actor, and the class predictions. \ref{['tab:arch_instances_vitb']} The network architecture resembles that of ViT-B li2021improved. The output tokens corresponding to $t=\lfloor T/2 \rfloor$ are forwarded to two parallel heads that predict for each token the bounding box coordinates and the probability of the bounding box being an actor. For the class prediction, we apply cross-attention between all output tokens of shape $8 \times 18 \times 18$ and the ones corresponding to the central frame. The attended tokens are then passed through an MLP for class predictions.
Figure 5: Per-category AP for Our single stage action detection method (30.0 mAP) and MViTv2-S (27.0 mAP) on AVA v.2. On top of the bar there is the difference per-class where categories with increased accuracy are marked in green and those decreased with our method in red.
...and 1 more figures

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

TL;DR

Abstract

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)