Table of Contents
Fetching ...

Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang

TL;DR

Comp comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision.

Abstract

Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.

Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

TL;DR

Comp comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision.

Abstract

Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Training pipeline of action-based video object segmentation (ActionVOS) with one clean reference (a) and three controlled noise scenarios defined by ActiSeg-NL, including text prompt noise (b), mask annotation noise (c), and their mixed condition (d). These scenarios approximate perception disturbances observed in egocentric recordings and in robot manipulation, and they test whether segmentation remains stable enough to support downstream action. We then apply and evaluate various adapted noise-robust strategies on each scenario for their mitigation effectiveness.
  • Figure 2: Comparison of category distributions before and after applying text prompt noise at $20\%$, $40\%$, and $60\%$ noise rates for major categories (proportion ${>}1\%$) and Others.
  • Figure 3: Mask annotation noise generation and severity statistics in ActiSeg-NL. (a) Generation pipeline: clean instance masks are separated, dilated with different kernel sizes, and recombined to produce noisy annotations. (b) Summary statistics: larger kernel leads to lower mIoU and cIoU, quantifying the decline in annotation quality.
  • Figure 4: Overview of the noisy action-based video object segmentation framework and robustness strategies. Left: the framework consumes video frames, noisy object names (e.g., "banana"), and action narrations (e.g., "cut piece pepper"), to predict a pixel-level mask. Middle: four complementary strategies, (a) Co-teaching, where two networks exchange small-loss samples to suppress label noise, (b) Noise-robust Losses GCE, SCE, and APL that balance accuracy and robustness, (c) ELR, which mitigates overfitting to noisy annotations with an EMA-based regularizer, (d) NPN, which integrates candidate-set reasoning with negative learning and with consistency across weak and strong views for pixel supervision. Right:PMHM architecture. During training, a lightweight auxiliary head runs in parallel with the main head to achieve prediction consistency on uncertain pixels, using symmetric KL divergence across heads and decoder layers.
  • Figure 5: Mask annotation noise qualitative results on ActiSeg-NL. Larger kernels thicken boundaries, coarsen edges, and introduce redundant regions, revealing trade-offs between foreground coverage and background precision.