T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation
Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji
TL;DR
This work tackles the challenge of generating detailed multi-instance 2D concept art aligned with user prompts and sketches to guide 3D scene layouts. It introduces T$^3$-S2S, a training-free triplet tuning framework for ControlNet consisting of Prompt Balance, Characteristics Priority, and Dense Tuning, complemented by a twin-structure terrain/isometric representation to improve terrain layout consistency. A cross-attention analysis motivates the three modules, which collectively address prompt energy imbalance and weak instance prominence, achieving substantial gains on CLIP-based metrics and user evaluations for complex scenes. The approach enhances small-instance fidelity and terrain coherence, offering a practical, data-free path to controllable concept-art generation for games, film, and VR workflows.
Abstract
2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.
