Control4D: Efficient 4D Portrait Editing with Text

Ruizhi Shao; Jingxiang Sun; Cheng Peng; Zerong Zheng; Boyao Zhou; Hongwen Zhang; Yebin Liu

Control4D: Efficient 4D Portrait Editing with Text

Ruizhi Shao, Jingxiang Sun, Cheng Peng, Zerong Zheng, Boyao Zhou, Hongwen Zhang, Yebin Liu

TL;DR

Control4D addresses the challenge of efficiently and consistently editing dynamic 4D portraits using text. It introduces GaussianPlanes, a plane-based decomposition of 4D Gaussian Splatting that accelerates and stabilizes the representation, together with a 4D generator that learns from the diffusion-based editor to produce coherent, high-quality edits. The framework combines a GAN-based generator and a diffusion-based editing loop, enabling fast training and robust spatiotemporal consistency across views and time. Experimental results show faster convergence, improved rendering quality, and stronger temporal coherence compared with prior 4D editing approaches, highlighting its practical impact for text-driven 4D portrait manipulation.

Abstract

We introduce Control4D, an innovative framework for editing dynamic 4D portraits using text instructions. Our method addresses the prevalent challenges in 4D editing, notably the inefficiencies of existing 4D representations and the inconsistent editing effect caused by diffusion-based editors. We first propose GaussianPlanes, a novel 4D representation that makes Gaussian Splatting more structured by applying plane-based decomposition in 3D space and time. This enhances both efficiency and robustness in 4D editing. Furthermore, we propose to leverage a 4D generator to learn a more continuous generation space from inconsistent edited images produced by the diffusion-based editor, which effectively improves the consistency and quality of 4D editing. Comprehensive evaluation demonstrates the superiority of Control4D, including significantly reduced training time, high-quality rendering, and spatial-temporal consistency in 4D portrait editing. The link to our project website is https://control4darxiv.github.io.

Control4D: Efficient 4D Portrait Editing with Text

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 13 figures, 1 table)

This paper contains 30 sections, 8 equations, 13 figures, 1 table.

Introduction
Related Work
2D Diffusion Models
NeRF-Based 3D Generation and Editing
NeRF for Dynamic Scenes
Gaussian Splatting
Overview
GaussianPlanes
GaussianPlanes in 3D
GaussianPlanes in 4D
4D Editing with GaussianPlanes
Connecting GAN to GaussianPlanes
Multi-level Generation with Guidance
Training Strategy
Experiment
...and 15 more sections

Figures (13)

Figure 1: We propose Control4D, an approach to high-fidelity and spatiotemporal-consistent 4D portrait editing with only text instructions. Given the multi-view videos as shown in the left and text instructions "Jensen Huang is roasting steak", Control4D generates realistic and 4D consistent editing results presented in the middle and right.
Figure 2: Pipeline of Control4D: Our method first utilizes GaussianPlanes to train the implicit representation of a 4D portrait scene, which are then rendered into latent features and RGB images using Gaussian rendering, serving as inputs for the GAN-based generator. Meanwhile, we apply the 2D-diffusion-based editor to edit the dataset with the noisy results and conditions as inputs, leading to updated results that are used as real images while the Superres. Module’s outputs serve as fake images fed into the Discriminator for discrimination. The discriminative results are used to calculate loss, allowing for iterative updates of both the Generator and Discriminator.
Figure 3: Illustration of the Generation with Multi-level Guidance: we propose a three-level image generation process to balance the generator training, where $E_g$ denotes for the global encoder and $E_l$ denotes for the local encoder.
Figure 4: Qualitative comparisons with Instruct-NeRF2NeRF(static): In a static scenario, given the prompt “Turn him into Elon Musk”, train the model to converge and we can see that, on the same dataset, our method (the top row) produces highly realistic renderings of human portraits, while instruct nerf2nerf exhibits lower levels of realism and consistency, along with unexpected distortions in facial features.
Figure 5: Qualitative comparisons with baseline(dynamic): In a dynamic scenario, given the prompt “Mark Zuckerberg”, compared to the baseline result (the first row) that only employs the dataset update (DU) method, our proposed approach (with the addition of GAN, the second row) demonstrates higher levels of realism and consistency in our rendered results.
...and 8 more figures

Control4D: Efficient 4D Portrait Editing with Text

TL;DR

Abstract

Control4D: Efficient 4D Portrait Editing with Text

Authors

TL;DR

Abstract

Table of Contents

Figures (13)