Table of Contents
Fetching ...

Text2Relight: Creative Portrait Relighting with Text Guidance

Junuk Cha, Mengwei Ren, Krishna Kumar Singh, He Zhang, Yannick Hold-Geoffroy, Seunghyun Yoon, HyunJoon Jung, Jae Shin Yoon, Seungryul Baek

TL;DR

Text2Relight addresses the challenge of text-driven portrait relighting by proposing a scalable data synthesis pipeline and a lighting-focused foundational diffusion model. It combines hierarchical text prompts generated by large language models, text-conditioned lighting image generation (RGB and HDR panorama), and image-based relighting for both foreground and background using a point-light representation and inverse rendering. The model is trained by repurposing InstructPix2Pix with auxiliary tasks (shadow removal and light positioning) and a targeted loss to align generated lighting with text prompts, achieving superior fidelity and content preservation compared to baselines. This approach enables creative, text-guided relighting of in-the-wild portraits and broad applications in portrait editing, though it notes limitations in background lighting realism and spatial context understanding.

Abstract

We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature, emotion, smell, time, and so on. However, the modeling of such mapping between the unbounded text and lighting is extremely challenging due to the lack of dataset where there exists no scalable data that provides large pairs of text and relighting, and therefore, current text-driven image editing models does not generalize to lighting-specific use cases. We overcome this problem by introducing a novel data synthesis pipeline: First, diverse and creative text prompts that describe the scenes with various lighting are automatically generated under a crafted hierarchy using a large language model (*e.g.,* ChatGPT). A text-guided image generation model creates a lighting image that best matches the text. As a condition of the lighting images, we perform image-based relighting for both foreground and background using a single portrait image or a set of OLAT (One-Light-at-A-Time) images captured from lightstage system. Particularly for the background relighting, we represent the lighting image as a set of point lights and transfer them to other background images. A generative diffusion model learns the synthesized large-scale data with auxiliary task augmentation (*e.g.,* portrait delighting and light positioning) to correlate the latent text and lighting distribution for text-guided portrait relighting.

Text2Relight: Creative Portrait Relighting with Text Guidance

TL;DR

Text2Relight addresses the challenge of text-driven portrait relighting by proposing a scalable data synthesis pipeline and a lighting-focused foundational diffusion model. It combines hierarchical text prompts generated by large language models, text-conditioned lighting image generation (RGB and HDR panorama), and image-based relighting for both foreground and background using a point-light representation and inverse rendering. The model is trained by repurposing InstructPix2Pix with auxiliary tasks (shadow removal and light positioning) and a targeted loss to align generated lighting with text prompts, achieving superior fidelity and content preservation compared to baselines. This approach enables creative, text-guided relighting of in-the-wild portraits and broad applications in portrait editing, though it notes limitations in background lighting realism and spatial context understanding.

Abstract

We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature, emotion, smell, time, and so on. However, the modeling of such mapping between the unbounded text and lighting is extremely challenging due to the lack of dataset where there exists no scalable data that provides large pairs of text and relighting, and therefore, current text-driven image editing models does not generalize to lighting-specific use cases. We overcome this problem by introducing a novel data synthesis pipeline: First, diverse and creative text prompts that describe the scenes with various lighting are automatically generated under a crafted hierarchy using a large language model (*e.g.,* ChatGPT). A text-guided image generation model creates a lighting image that best matches the text. As a condition of the lighting images, we perform image-based relighting for both foreground and background using a single portrait image or a set of OLAT (One-Light-at-A-Time) images captured from lightstage system. Particularly for the background relighting, we represent the lighting image as a set of point lights and transfer them to other background images. A generative diffusion model learns the synthesized large-scale data with auxiliary task augmentation (*e.g.,* portrait delighting and light positioning) to correlate the latent text and lighting distribution for text-guided portrait relighting.

Paper Structure

This paper contains 30 sections, 5 equations, 30 figures, 3 tables.

Figures (30)

  • Figure 1: Results from existing text-guided image editing models (left: brooks2023instructpix2pix, right: fu2023guiding) which largely distorts the input images by generating new contents.
  • Figure 2: Overview of the data synthesis pipeline. We first generate a text prompt with a language hierarchy from which we generate a lighting image. Subsequently, we transfer the lighting from the lighting image to a portrait image captured from either lightstage or real world (with background inpainting). These form the training dataset for our Text2Relight model.
  • Figure 3: Pipeline for text generation with hierarchy. For sub-categories generation, we ask 'Generate words related to$\{$category$\}$. Write 30 or more words on a single line, separated by commas.'; and for sentence generation, we ask 'could you describe the lighting property of a random scene using the words of$\{$selected words$\}$.' to ChatGPT, respectively.
  • Figure 4: Pipeline for foreground relighting.
  • Figure 5: Pipeline for background relighting.
  • ...and 25 more figures