Table of Contents
Fetching ...

Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao

TL;DR

BiCo tackles the challenge of composing concepts from images and videos by binding visual concepts to textual prompt tokens in a one-shot setting and assembling bound tokens from multiple sources into a final prompt. It introduces a hierarchical binder framework within a diffusion transformer, a Diversify-and-Absorb Mechanism to improve binding accuracy, and a Temporal Disentanglement Strategy to align image and video concepts. Through extensive quantitative and qualitative evaluations, BiCo outperforms baselines in concept consistency, prompt fidelity, and motion quality, and supports non-object concepts and cross-source composition. The approach broadens creative capabilities for visual content generation and editing by enabling flexible, multi-source concept integration.

Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

Composing Concepts from Images and Videos via Concept-prompt Binding

TL;DR

BiCo tackles the challenge of composing concepts from images and videos by binding visual concepts to textual prompt tokens in a one-shot setting and assembling bound tokens from multiple sources into a final prompt. It introduces a hierarchical binder framework within a diffusion transformer, a Diversify-and-Absorb Mechanism to improve binding accuracy, and a Temporal Disentanglement Strategy to align image and video concepts. Through extensive quantitative and qualitative evaluations, BiCo outperforms baselines in concept consistency, prompt fidelity, and motion quality, and supports non-object concepts and cross-source composition. The approach broadens creative capabilities for visual content generation and editing by enabling flexible, multi-source concept integration.

Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

Paper Structure

This paper contains 24 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Illustration of BiCo , a one-shot method that enables flexible visual concept composition by binding visual concepts with the corresponding prompt tokens and composing the target prompt with bound tokens from various sources (§\ref{['sec:intro']}).
  • Figure 2: Overview of BiCo (§\ref{['sec:method_overview']}). BiCo first adopts a binder structure to learn visual concepts into corresponding prompt tokens, and then composes different concepts by passing corresponding prompt tokens through different adapters for the updated prompt as condition.
  • Figure 3: Hierarchical Binder Structure (§\ref{['sec:adapter_structure']}). It consists of global and per-block binders, where each binder contains an MLP with residual connections. For video inputs, a dual-branch binder structure with spatial and temporal MLPs is incorporated to better address temporal concepts.
  • Figure 4: Prompt Diversification (§\ref{['sec:dam']}). The VLM extracts key spatial and temporal concepts from the visual input, and then composes them into diverse spatial-only or spatiotemporal prompts.
  • Figure 5: Qualitative Comparisons with Previous Methods (§\ref{['sec:comparison_qualitative']}). The input visual concepts and composed prompts are on the left.
  • ...and 6 more figures