Composing Concepts from Images and Videos via Concept-prompt Binding
Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao
TL;DR
BiCo tackles the challenge of composing concepts from images and videos by binding visual concepts to textual prompt tokens in a one-shot setting and assembling bound tokens from multiple sources into a final prompt. It introduces a hierarchical binder framework within a diffusion transformer, a Diversify-and-Absorb Mechanism to improve binding accuracy, and a Temporal Disentanglement Strategy to align image and video concepts. Through extensive quantitative and qualitative evaluations, BiCo outperforms baselines in concept consistency, prompt fidelity, and motion quality, and supports non-object concepts and cross-source composition. The approach broadens creative capabilities for visual content generation and editing by enabling flexible, multi-source concept integration.
Abstract
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
