LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency
Fangbing Liu, Pengfei Duan, Wen Li, Yi He
TL;DR
This paper tackles detail degradation and context loss in flow-matching text-guided image editing by introducing LGCC, a framework that combines Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL) within a latent-space editing pipeline. LGNC preserves local spatial structure by coupling target embeddings with their locally perturbed counterparts, while CCL semantically regularizes edits to align with the input text and prevent unwanted removals. The authors integrate LGCC with a pre-trained Bagel model through curriculum learning, achieving faster inference (requiring about $40\%$--$50\%$ of Bagel/Flux time) and improved editing quality, evidenced by a $+1.60\%$ gain in local detail scores and a $+0.53\%$ gain in overall scores on I$^2$EBench. The results demonstrate that LGCC maintains content integrity and detail while significantly accelerating lightweight and universal editing, offering a cost-efficient enhancement to existing flow-matching pipelines.
Abstract
Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.
