Table of Contents
Fetching ...

S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

Guangzhi Wang, Tianyi Chen, Kamran Ghasedi, HsiangTao Wu, Tianyu Ding, Chris Nuesmeyer, Ilya Zharkov, Mohan Kankanhalli, Luming Liang

TL;DR

S3Editor tackles the core challenges of face video editing—limited supervision, architectural capacity, and over-editing—by introducing a sparse, semantic-disentangled self-training framework. It combines (i) self-training to generate pseudo-edits in latent space, (ii) a semantic disentangled editing architecture that clusters edits into $K$ groups with cluster-specific transformations for flexible routing, and (iii) a structured sparsity learning scheme with neuron partitioning to localize edits and avoid unintended changes. The approach is model-agnostic and demonstrably improves identity preservation, editing faithfulness, and temporal consistency across diffusion- and GAN-based backbones, with strong generalization to unseen edits. This work advances practical, controllable, and scalable face video editing by enabling precise, localized edits while maintaining video coherence and fidelity.

Abstract

Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.

S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

TL;DR

S3Editor tackles the core challenges of face video editing—limited supervision, architectural capacity, and over-editing—by introducing a sparse, semantic-disentangled self-training framework. It combines (i) self-training to generate pseudo-edits in latent space, (ii) a semantic disentangled editing architecture that clusters edits into groups with cluster-specific transformations for flexible routing, and (iii) a structured sparsity learning scheme with neuron partitioning to localize edits and avoid unintended changes. The approach is model-agnostic and demonstrably improves identity preservation, editing faithfulness, and temporal consistency across diffusion- and GAN-based backbones, with strong generalization to unseen edits. This work advances practical, controllable, and scalable face video editing by enabling precise, localized edits while maintaining video coherence and fidelity.

Abstract

Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
Paper Structure (17 sections, 4 equations, 5 figures, 5 tables)

This paper contains 17 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Face video editing results with diffusion video autoencoder kim2023diffusion_videoae and S3Editor (Sparse Semantic-disentangled Self-training) face video editing framework. S3Editor captures better editing faithfulness with better identity preservation (see the left), and avoid over-editing, e.g., the skin color on the right by the baseline method is unexpectedly affected by the hair color while ours preserve the skin color well.
  • Figure 2: Overview of S3Editor framework, which consists of three components to improve face video editing methods. (i) Self-Training (Section \ref{['sec.self-training']}): Given a face latent $\bm{x}$, we randomly select an attribute, e.g, Smiling, to perform an edit operation on this latent, which is then taken as the condition for face generation. A set of optimization objectives using the edited face and the original face are designed to optimize the generation model. (ii) Semantic disentangled editing architecture (Section \ref{['sec.arch']}): We cluster all possible editing into $K$ clusters, and learn a set of transformations $\mathbf{T}_1 ... \mathbf{T}_K$ to conditionally encode the edited latent. (iii) Sparse Learning to avoid over-editing (Section \ref{['sec.sparse_learning']}): We encourage sparsity in each transformation to facilitate precise and localized editing.
  • Figure 3: Sparse learning for localized editing. Facial landmarks extracted by a pre-trained detector are clustered according to the geometrical proximity. We encourage structured sparsity in each transformation $\mathbf{T}_k$ to identify and deactivate malicious neurons to facilitate the editing precision.
  • Figure 4: Qualitative results produced by S3Editor. S3Editor is able to preserve temporal consistency for motion-intensive edit (+Smile) and maintain original identity (+Young) while maintaining editing locality (+Mustache, +Eyeglasses).
  • Figure 5: Compared to the baseline method kim2023diffusion_videoae, S3Editor not only successfully finishes the editing requirement, but also shows better locality for the edit +bushy_eyebrows.