Table of Contents
Fetching ...

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang

TL;DR

Open-vocabulary action recognition (OVAR) faces practical challenges when user-provided class descriptions are noisy. The authors propose DENOISER, a two-part framework that jointly performs generative denoising of class texts and discriminative open-vocabulary labeling, connected through alternating optimization and guided by inter-modal and intra-modal cues. Through simulated multi-level text noise on standard video benchmarks, DENOISER demonstrates substantially improved robustness over baseline OVAR methods and spell-checkers, with ablations clarifying the contributions of each component. The approach offers practical impact for real-world video understanding where textual descriptions are imperfect, delivering more reliable cross-modal recognition under noisy supervision.

Abstract

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

TL;DR

Open-vocabulary action recognition (OVAR) faces practical challenges when user-provided class descriptions are noisy. The authors propose DENOISER, a two-part framework that jointly performs generative denoising of class texts and discriminative open-vocabulary labeling, connected through alternating optimization and guided by inter-modal and intra-modal cues. Through simulated multi-level text noise on standard video benchmarks, DENOISER demonstrates substantially improved robustness over baseline OVAR methods and spell-checkers, with ablations clarifying the contributions of each component. The approach offers practical impact for real-world video understanding where textual descriptions are imperfect, delivering more reliable cross-modal recognition under noisy supervision.

Abstract

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
Paper Structure (18 sections, 18 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 18 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: For open-vocabulary action recognition (OVAR), existing researches neglect an essential aspect: the key class descriptions from users maybe noisy ( e.g., misspelling and typos), resulting in potential classification errors and limiting the real-world practicality. Right: Rethinking the robustness for popular methods wang2021actionclipzhou2023non. On various datasets, these methods exhibit high sensitivity to noises. Besides, as the noise level increases, the performance degrades significantly.
  • Figure 2: Framework Overview. DENOISER is composed of one generative part $\Psi_{\mathrm{gene}}$ and one discriminative part $\Psi_{\mathrm{disc}}$. $\Psi_{\mathrm{gene}}$ views denoising the text labels as a decoding process $\mathcal{T}_{i-1}\rightarrow\mathcal{T}_{i}$. We first propose text candidates $\Phi_{\mathrm{prop}}$ for $\mathcal{T}_{i-1}$ based on spelling similarity; then choose the best candidate by inter-modal weighting $\Phi_{\mathrm{inter}}$ and intra-modal weighting $\Phi_{\mathrm{intra}}$. $\Phi_{\mathrm{inter}}$ uses visual-textual information, while $\Phi_{\mathrm{intra}}$ relies solely on texts to vote. $\Psi_{\mathrm{disc}}$ assigns classes to visual samples. Then only visual samples that match classes can vote for text candidates, making better usage of classes. We optimize alternatively between generative and discriminative steps to tackle noisy OVAR.
  • Figure 3: Ablation Study on Noise Type. We evaluate the robustness of our model on UCF101 with ActionCLIP as $\Phi_{\mathrm{OVAR}}$. "Mixed" means that all three types of perturbation: "Substitute", "Insert", "Delete" take place with equal probability. Our framework shows good resilience, especially against the noises of inserting or substituting.
  • Figure 4: Ablation Study for Proposal Number$\mathbf{K}$. We evaluate on UCF101 using ActionCLIP as $\Phi_{\mathrm{OVAR}}$. As proposal number $K$ increases, the Top-1 Acc increases and converges gradually towards the upper bound, but also brings heavier computing costs.
  • Figure 5: Visualization of Denoising Process.Left: classification result with noisy text labels (in crosses with black border). Middle: text candidates (in crosses without black border), the visual samples (in dots) that are used to vote for candidates. Right: denoised class texts (in crosses with black border) help for better classification.
  • ...and 2 more figures