EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
Umar Khalid, Kashif Munir, Hasan Iqbal, Azib Farooq, Jing Hua, Nazanin Rahnavard, Chen Chen, Victor Zhu, Zhengping Ji
TL;DR
This work tackles the challenge of editing complex visual content under ambiguous, multimodal instructions by introducing EVLM, a vision–language model that employs reflective multimodal reasoning. EVLM combines Chain-of-Thought supervision with Reflection-Aware KL-Divergence Target Optimization (RKTO) and is trained on a Reflective-Edit dataset of 30,000 CoT examples to produce concise, context-aware editing instructions and target masks. The approach yields strong gains in alignment with human intent across 2D, 3D, and 4D editing tasks and demonstrates robust cross-domain performance with diffusion-based editors. By enabling interpretable reasoning and refined alignment, EVLM offers a scalable foundation for multimodal editing that can generalize to varied visual reasoning requirements in real-world applications.
Abstract
Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.
