Table of Contents
Fetching ...

T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji

TL;DR

This work tackles the challenge of generating detailed multi-instance 2D concept art aligned with user prompts and sketches to guide 3D scene layouts. It introduces T$^3$-S2S, a training-free triplet tuning framework for ControlNet consisting of Prompt Balance, Characteristics Priority, and Dense Tuning, complemented by a twin-structure terrain/isometric representation to improve terrain layout consistency. A cross-attention analysis motivates the three modules, which collectively address prompt energy imbalance and weak instance prominence, achieving substantial gains on CLIP-based metrics and user evaluations for complex scenes. The approach enhances small-instance fidelity and terrain coherence, offering a practical, data-free path to controllable concept-art generation for games, film, and VR workflows.

Abstract

2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

TL;DR

This work tackles the challenge of generating detailed multi-instance 2D concept art aligned with user prompts and sketches to guide 3D scene layouts. It introduces T-S2S, a training-free triplet tuning framework for ControlNet consisting of Prompt Balance, Characteristics Priority, and Dense Tuning, complemented by a twin-structure terrain/isometric representation to improve terrain layout consistency. A cross-attention analysis motivates the three modules, which collectively address prompt energy imbalance and weak instance prominence, achieving substantial gains on CLIP-based metrics and user evaluations for complex scenes. The approach enhances small-instance fidelity and terrain coherence, offering a practical, data-free path to controllable concept-art generation for games, film, and VR workflows.

Abstract

2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

Paper Structure

This paper contains 21 sections, 11 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: The SDXL-base model podell2023sdxl and ControlNet model xinsir2023controlnet perform well with common instances like humans, but they struggle with complex multi-instance scenes involving small instances and fail to accurately follow users' prompt.
  • Figure 2: The SDXL-base podell2023sdxl and ControlNet models xinsir2023controlnet struggle with complex multi-instance scenes based on sketch images and text prompt, even with improved dense tuning kim2023dense.
  • Figure 3: Embedding energy comparison between a global prompt (“Isometric view of game scene, a plain, walk path, a river, a high mountain, houses.”) and single-word prompts (each keyword separated and embedded individually, higher than that in the group). The energy imbalance can lead to attention competition, and low-energy small instances ("path" and "houses") are easily forgotten. (c) Cosine similarity between embeddings of (a) and (b)
  • Figure 4: Interaction between attention maps and value matrices with prompts from Fig. \ref{['figs:controlnet']} using dense tuning. (a) Attention maps highlight strong sketch relevance. (b/c) Five-channel value-feature pairs reveal the importance of extrema. Despite feature enhancement improving instance chances, forgetting still occurs as extrema are not prominent. Statistics are shown in Fig. \ref{['fig:max']} and Appendix C.
  • Figure 5: Generations by amplifying the TopK extrema twice in the value matrices based on the pipeline in Fig. \ref{['fig:attn_value']}, where most instances appear, but uniform amplification also introduces noise.
  • ...and 15 more figures