Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Min Yang; Zichen Zhang; Limin Wang

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Min Yang, Zichen Zhang, Limin Wang

TL;DR

Temporal2Seq addresses fragmentation in temporal video understanding by introducing a unified token-based sequence-to-sequence framework that handles TAD, TAS, and GEBD via task prompts. It encodes video features, uses a common token vocabulary of length $H$, and generates outputs through an autoregressive encoder–decoder, with per-task vocabulary constraints during inference. The approach is trained with two joint-training schemes and a data-balance strategy, achieving cross-task gains and better generalization on unseen datasets compared to single-task baselines. This work demonstrates the potential of a generalist temporal video model and outlines avenues for improving long-context modeling and cross-task data balancing in future research.

Abstract

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

TL;DR

, and generates outputs through an autoregressive encoder–decoder, with per-task vocabulary constraints during inference. The approach is trained with two joint-training schemes and a data-balance strategy, achieving cross-task gains and better generalization on unseen datasets compared to single-task baselines. This work demonstrates the potential of a generalist temporal video model and outlines avenues for improving long-context modeling and cross-task data balancing in future research.

Abstract

Paper Structure (32 sections, 4 equations, 5 figures, 9 tables)

This paper contains 32 sections, 4 equations, 5 figures, 9 tables.

Introduction
Related Work
Temporal Action Detection.
Temporal Action Segmentation.
Generic Event Boundary Detection.
Multi-Task Learning.
Method
Overview
Unified Interface with Tokenization
Training
Two Ways of Joint Training
Data Balance Strategy During Training
Loss Functions
Inference
Experiments
...and 17 more sections

Figures (5)

Figure 1: The overview of Temporal2Seq. We input video sequences from different tasks and their corresponding task prompts $[TASK]$ into the model, the model produces task output tokens which can be detokenized into the required task output for visualization.
Figure 2: The overall pipeline of Temporal2Seq. The input to our model is video features with temporal dimension $T$ extracted by the backbone and a sequence of discrete tokens with token numbers of $N$ translated from annotations. Added with frame-level positional encoding, the encoder maps them into hidden representations. At training time, the decoder takes feature queries $M$ transformed from task annotations $A$ as input and predict the output conditioned by prompt start token, and a loss function is applied afterward. During inference, the decoder generates one token at a time conditioned on the preceding tokens and this process of token generation is repeated until the model provides all predictions. Due to space limitations, $H$ and $O$ are not shown here.
Figure 3: Construction of token vocabulary. In the vocabulary, we allocate location tokens for action boundaries and category tokens for all three tasks. During inference, the model generates output tokens one by one, each corresponds to a position in vocabulary.
Figure 4: Mixing ways of training datasets. (a) Data mixing involves the creation of a dataset that contains mixed frame-target sequence pairs drawn from different tasks and then split into batches for each iteration. (b) Batch mixing samples batches of data from all tasks and then trains the combined batches in each iteration.
Figure 5: More details of inference. Here we visualize the Temporal2Seq inference process for three tasks.

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

TL;DR

Abstract

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)