Table of Contents
Fetching ...

Cognition Transferring and Decoupling for Text-supervised Egocentric Semantic Segmentation

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Fanman Meng, Qingbo Wu, Hongliang Li

TL;DR

This work introduces Text-supervised Egocentric Semantic Segmentation (TESS) and addresses the challenge of learning fine-grained segmentation from image-level text supervision in cluttered egocentric scenes. The proposed Cognition Transferring and Decoupling Network (CTDN) learns wearer-object relations (via L_rel), transfers cognitive knowledge from large-scale models through the Cognition Transferring Module (CTM) and cognition set, and explicitly decouples foreground/background representations with the Foreground-background Decoupling Module (FDM). A CAM refinement and segmentation pipeline generates high-quality pseudo masks, enabling strong weakly supervised segmentation performance on four egocentric benchmarks and competitive results on third-view data. The approach demonstrates that incorporating cognition transfer and foreground-background decoupling yields robust pseudo-masks and improved segmentation, highlighting the practical impact for scalable egocentric scene understanding. Future work includes temporal modeling and video-based backbones to address motion and state changes in egocentric footage.

Abstract

In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at https://github.com/ZhaofengSHI/CTDN.

Cognition Transferring and Decoupling for Text-supervised Egocentric Semantic Segmentation

TL;DR

This work introduces Text-supervised Egocentric Semantic Segmentation (TESS) and addresses the challenge of learning fine-grained segmentation from image-level text supervision in cluttered egocentric scenes. The proposed Cognition Transferring and Decoupling Network (CTDN) learns wearer-object relations (via L_rel), transfers cognitive knowledge from large-scale models through the Cognition Transferring Module (CTM) and cognition set, and explicitly decouples foreground/background representations with the Foreground-background Decoupling Module (FDM). A CAM refinement and segmentation pipeline generates high-quality pseudo masks, enabling strong weakly supervised segmentation performance on four egocentric benchmarks and competitive results on third-view data. The approach demonstrates that incorporating cognition transfer and foreground-background decoupling yields robust pseudo-masks and improved segmentation, highlighting the practical impact for scalable egocentric scene understanding. Future work includes temporal modeling and video-based backbones to address motion and state changes in egocentric footage.

Abstract

In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at https://github.com/ZhaofengSHI/CTDN.
Paper Structure (27 sections, 14 equations, 8 figures, 5 tables)

This paper contains 27 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualizations of different models according to the given class text "your right hand". (a): the input egocentric image. (b): the class-activation map (CAM) of the original CLIP model. (c): the CAM of the finetuned CLIP model. (d): the CAM of our CTDN model.
  • Figure 2: The overview of three-stage CTDN. In the first stage, given the egocentric image, class texts, and cognition set, we extract the corresponding features and first learn the egocentric relations. In addition, the Cognition Transferring Module (CTM) and Foreground-background Decoupling Module (FDM) are developed to transfer the cognitive knowledge into our egocentric model and disentangle the foreground and background representations. In the second stage, we perform backpropagation to obtain the initial CAM and refine it by MHSA. In the third stage, the pseudo masks are generated and a DeepLab V2 model chen2017deeplab is adopted for the egocentric semantic segmentation.
  • Figure 3: Schematic of FDM. We first extract the cognition features and compute the prototypes of foreground and background. Then, we calculate the similarities between the projected visual features and prototypes to get the respective scores, based on which we conduct representation decoupling for the fore/background features.
  • Figure 4: Analysis on hyperparameters ${\lambda}_{1}$, ${\lambda}_{2}$, and ${\lambda}_{3}$ on the Egohand benchmark.
  • Figure 5: The visualizations of the initial class-activation maps (CAM) under different hyperparameter ${\lambda}_{3}$ settings on the EgoHand benchmark.
  • ...and 3 more figures