Table of Contents
Fetching ...

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

Fangbing Liu, Pengfei Duan, Wen Li, Yi He

TL;DR

This paper tackles detail degradation and context loss in flow-matching text-guided image editing by introducing LGCC, a framework that combines Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL) within a latent-space editing pipeline. LGNC preserves local spatial structure by coupling target embeddings with their locally perturbed counterparts, while CCL semantically regularizes edits to align with the input text and prevent unwanted removals. The authors integrate LGCC with a pre-trained Bagel model through curriculum learning, achieving faster inference (requiring about $40\%$--$50\%$ of Bagel/Flux time) and improved editing quality, evidenced by a $+1.60\%$ gain in local detail scores and a $+0.53\%$ gain in overall scores on I$^2$EBench. The results demonstrate that LGCC maintains content integrity and detail while significantly accelerating lightweight and universal editing, offering a cost-efficient enhancement to existing flow-matching pipelines.

Abstract

Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

TL;DR

This paper tackles detail degradation and context loss in flow-matching text-guided image editing by introducing LGCC, a framework that combines Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL) within a latent-space editing pipeline. LGNC preserves local spatial structure by coupling target embeddings with their locally perturbed counterparts, while CCL semantically regularizes edits to align with the input text and prevent unwanted removals. The authors integrate LGCC with a pre-trained Bagel model through curriculum learning, achieving faster inference (requiring about -- of Bagel/Flux time) and improved editing quality, evidenced by a gain in local detail scores and a gain in overall scores on IEBench. The results demonstrate that LGCC maintains content integrity and detail while significantly accelerating lightweight and universal editing, offering a cost-efficient enhancement to existing flow-matching pipelines.

Abstract

Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.

Paper Structure

This paper contains 29 sections, 23 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Existing text-guided image editing often lacks detail, misses context, and sometimes over-edits images.
  • Figure 2: The workflow of LGCC. LGCC consists of two key modifications in the flow matching approach, including Local Gaussian Noise Coupling (LGNC) and Context Consistency Loss (CCL), which are in red boxes.
  • Figure 3: Exploring image details across different works.
  • Figure 4: Typical over-edit cases by exiting works.
  • Figure 5: Bagel vs LGCC's results of various steps.
  • ...and 3 more figures