Table of Contents
Fetching ...

Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations

Tiancheng Shen, Jun Hao Liew, Long Mai, Lu Qi, Jiashi Feng, Jiaya Jia

TL;DR

The paper tackles the challenge of transforming vague image-editing intents into concrete, diverse editing instructions. It introduces Image Editing Recommendation (IER) and a multimodal Creativity-Vision Language Assistant (Creativity-VLA) that couples global and local editing through a token-for-localization mechanism. A 16,000-item imagination-in-editing instruction dataset is created via a CoT-guided GPT-4 workflow and manual curation, enabling targeted instruction tuning. Empirical results show Creativity-VLA achieves superior hint alignment and diverse, high-quality editing suggestions, outperforming baselines in user studies and CLIP-based evaluations. This work promises to make creative image editing more accessible by expanding ideation, enabling rapid prototyping, and supporting localized edits in practical design contexts.

Abstract

Advances in text-based image generation and editing have revolutionized content creation, enabling users to create impressive content from imaginative text prompts. However, existing methods are not designed to work well with the oversimplified prompts that are often encountered in typical scenarios when users start their editing with only vague or abstract purposes in mind. Those scenarios demand elaborate ideation efforts from the users to bridge the gap between such vague starting points and the detailed creative ideas needed to depict the desired results. In this paper, we introduce the task of Image Editing Recommendation (IER). This task aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose. To this end, we introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation. We train Creativity-VLA on our edit-instruction dataset specifically curated for IER. We further enhance our model with a novel 'token-for-localization' mechanism, enabling it to support both global and local editing operations. Our experimental results demonstrate the effectiveness of \ours{} in suggesting instructions that not only contain engaging creative elements but also maintain high relevance to both the input image and the user's initial hint.

Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations

TL;DR

The paper tackles the challenge of transforming vague image-editing intents into concrete, diverse editing instructions. It introduces Image Editing Recommendation (IER) and a multimodal Creativity-Vision Language Assistant (Creativity-VLA) that couples global and local editing through a token-for-localization mechanism. A 16,000-item imagination-in-editing instruction dataset is created via a CoT-guided GPT-4 workflow and manual curation, enabling targeted instruction tuning. Empirical results show Creativity-VLA achieves superior hint alignment and diverse, high-quality editing suggestions, outperforming baselines in user studies and CLIP-based evaluations. This work promises to make creative image editing more accessible by expanding ideation, enabling rapid prototyping, and supporting localized edits in practical design contexts.

Abstract

Advances in text-based image generation and editing have revolutionized content creation, enabling users to create impressive content from imaginative text prompts. However, existing methods are not designed to work well with the oversimplified prompts that are often encountered in typical scenarios when users start their editing with only vague or abstract purposes in mind. Those scenarios demand elaborate ideation efforts from the users to bridge the gap between such vague starting points and the detailed creative ideas needed to depict the desired results. In this paper, we introduce the task of Image Editing Recommendation (IER). This task aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose. To this end, we introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation. We train Creativity-VLA on our edit-instruction dataset specifically curated for IER. We further enhance our model with a novel 'token-for-localization' mechanism, enabling it to support both global and local editing operations. Our experimental results demonstrate the effectiveness of \ours{} in suggesting instructions that not only contain engaging creative elements but also maintain high relevance to both the input image and the user's initial hint.
Paper Structure (17 sections, 6 equations, 7 figures, 3 tables)

This paper contains 17 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: In typical editing scenarios, users have the tendency to start with oversimplified text prompts, which we refer to as editing hint, as they are often uncertain about what visual results they desire and the corresponding editing instructions needed to achieve appealing editing results. Given only such coarse hints, modern image editing methods geng2023instructdiffusionZhang2023MagicBrush often produce unimpressive results, as shown in part (b). To overcome this, we introduce Creative-VLA which is designed to jointly leverage the visual understanding and creative reasoning capability of Large Vision-Language Models to generate diverse editing instructions, thus achieving the desired visual effect in part (a).
  • Figure 2: The necessity for local editing is highlighted in (a). InstructDiffusion geng2023instructdiffusion is not well-suited to tasks such as product design that require localized editing models. (b) provides an example of an appropriate region for implementing suggestions. As demonstrated in (c), decoupling the suggestion and location in the instruction can be beneficial for human adjustments when the predicted location is not ideal and the suggestion is acceptable. The images enclosed by the blue rectangle demonstrate the results of global editing using InstructDiffusion geng2023instructdiffusion, and those within the green rectangle show local editing results achieved using GLIGEN Li_2023_CVPR. The red rectangle indicates the recommended location for the suggestion.
  • Figure 3: The pipeline of collecting instruction dataset for imagination in editing. Visual understanding, imagining hint-related concepts, reason to generate instruction and dataset curation form a chain of reasoning processes to obtain high-quality data.
  • Figure 4: The architecture of Creativity-VLA. It converts input image and editing hints into editing suggestion and editing token, which is used to recommend the locations for editing.
  • Figure 5: Qualitative comparison among MagicBrush, InstructDiffusion, LLaVA-v1.5, GPT-4V and Creativity-VLA. Due to the space limitation, corresponding editing instructions are in the supplementary file.
  • ...and 2 more figures