Table of Contents
Fetching ...

Tracking and Segmenting Anything in Any Modality

Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

TL;DR

SATA presents a universal framework for tracking and segmentation that operates across arbitrary input modalities and supports multiple subtasks within a single shared model. It introduces a Decoupled Mixture-of-Expert (DeMoE) to separately model cross-modal shared knowledge and modality-specific clues, and a Task-aware MOT (TaMOT) pipeline to unify all task outputs as a consistent set of instance IDs. The approach combines CpMoE and SaMoE with cross-modal complementary and orthogonal losses, along with a Candidates Generation Module and a Memory-enhanced Module to handle spatiotemporal reasoning, enabling robust multi-task, multi-modal performance. Evaluations on 18 challenging benchmarks across 4 modalities and 4 subtasks demonstrate state-of-the-art results and strong generalization, highlighting SATA’s potential as a foundation for generalizable video understanding.

Abstract

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

Tracking and Segmenting Anything in Any Modality

TL;DR

SATA presents a universal framework for tracking and segmentation that operates across arbitrary input modalities and supports multiple subtasks within a single shared model. It introduces a Decoupled Mixture-of-Expert (DeMoE) to separately model cross-modal shared knowledge and modality-specific clues, and a Task-aware MOT (TaMOT) pipeline to unify all task outputs as a consistent set of instance IDs. The approach combines CpMoE and SaMoE with cross-modal complementary and orthogonal losses, along with a Candidates Generation Module and a Memory-enhanced Module to handle spatiotemporal reasoning, enabling robust multi-task, multi-modal performance. Evaluations on 18 challenging benchmarks across 4 modalities and 4 subtasks demonstrate state-of-the-art results and strong generalization, highlighting SATA’s potential as a foundation for generalizable video understanding.

Abstract

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

Paper Structure

This paper contains 29 sections, 18 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of existing tracking and segmentation paradigm. (a) Task- and modality-specific paradigm. (b) Unified task paradigm. (c) Unified modality paradigm. (d) Unified task and modality paradigm obtained by combining existing unified task methods and unified modality models.
  • Figure 2: Analysis of data distribution gap and comparison of the proposed SATA with existing strategies. (a) Statistical overview of the data distribution gap during model training. (b) Overview of the proposed SATA framework. (c) Our SATA v.s. the existing methods on 11 challenging benchmarks. Here, SU-Unicorn denotes the combination of SUTrack sutrack and Unicorn unicorn, Flex-UNINEXT denotes the combination of FlexTrack FlexTrack and UNINEXT uninext.
  • Figure 3: Overview architecture of our SATA, which consists of two core components: the Decoupled Mixture-of-Expert mechanism and the Task-aware MOT pipeline.
  • Figure 4: Overview architecture of our CpMoE and SaMoE. (a) CpMoE.(b) SaMoE.
  • Figure 5: Illustration of the fine-grained instance embeddings and spatiotemporal relationship modeling. (a) Fine-grained instance embeddings. (b) Spatiotemporal relationship modeling.
  • ...and 5 more figures