Cognition Transferring and Decoupling for Text-supervised Egocentric Semantic Segmentation
Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Fanman Meng, Qingbo Wu, Hongliang Li
TL;DR
This work introduces Text-supervised Egocentric Semantic Segmentation (TESS) and addresses the challenge of learning fine-grained segmentation from image-level text supervision in cluttered egocentric scenes. The proposed Cognition Transferring and Decoupling Network (CTDN) learns wearer-object relations (via L_rel), transfers cognitive knowledge from large-scale models through the Cognition Transferring Module (CTM) and cognition set, and explicitly decouples foreground/background representations with the Foreground-background Decoupling Module (FDM). A CAM refinement and segmentation pipeline generates high-quality pseudo masks, enabling strong weakly supervised segmentation performance on four egocentric benchmarks and competitive results on third-view data. The approach demonstrates that incorporating cognition transfer and foreground-background decoupling yields robust pseudo-masks and improved segmentation, highlighting the practical impact for scalable egocentric scene understanding. Future work includes temporal modeling and video-based backbones to address motion and state changes in egocentric footage.
Abstract
In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at https://github.com/ZhaofengSHI/CTDN.
