Table of Contents
Fetching ...

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang

TL;DR

The paper tackles action customization in text-to-image generation by learning action-specific identifiers that disentangle actions from appearance. It introduces Action-Disentangled Identifier (ADI), which extends semantic conditioning with layer-wise tokens and uses gradient masking across context-different and action-different pairs to block action-agnostic features from leaking into the learned action representations. A new ActionBench benchmark is proposed to evaluate action fidelity and subject consistency across diverse actions and unseen subjects, including animals. Empirical results show ADI achieves superior action accuracy and preserves subject appearance, outperforming strong baselines and demonstrating practical potential for flexible, action-focused image synthesis.

Abstract

This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

TL;DR

The paper tackles action customization in text-to-image generation by learning action-specific identifiers that disentangle actions from appearance. It introduces Action-Disentangled Identifier (ADI), which extends semantic conditioning with layer-wise tokens and uses gradient masking across context-different and action-different pairs to block action-agnostic features from leaking into the learned action representations. A new ActionBench benchmark is proposed to evaluate action fidelity and subject consistency across diverse actions and unseen subjects, including animals. Empirical results show ADI achieves superior action accuracy and preserves subject appearance, outperforming strong baselines and demonstrating practical potential for flexible, action-focused image synthesis.

Abstract

This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.
Paper Structure (23 sections, 9 equations, 13 figures, 1 table)

This paper contains 23 sections, 9 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Action customization results of our ADI method. By inverting representative action-related features, the learned identifiers "$<\!\!\text{A}\!\!>$" can be paired with a variety of characters and animals to contribute to the generation of accurate, diverse and high-quality images.
  • Figure 2: Action customization results of existing subject-driven customization methods. Due to the preference to search for low-level invariants when asked to learn high-level action features, some methods fail to generate the specified actions, while others confuse the animals with human appearances.
  • Figure 3: Overview of our ADI method. ADI learns more efficient action identifiers by extending the semantic conditioning space and masking gradient updates to action-agnostic channels.
  • Figure 4: Visual comparisons of all methods. For each action, we present the generated results showcasing its pairing with a human character and an animal.
  • Figure 5: Ablation study. We remove or revise one implementation at a time to demonstrate the effects of the identifier extension and the gradient masking.
  • ...and 8 more figures