Table of Contents
Fetching ...

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

James Baker

TL;DR

BRAT addresses the limitation of UNet-centric textual inversion by introducing a token-based, architecture-agnostic approach that learns orthogonal bonus tokens to complement the standard subject token. It combines a baseline LDM objective with a cosine-regularization term that enforces orthogonality between the placeholder token and the bonus token, and extends to multiple bonus tokens. The method is validated on subject and style personalization using both UNet and vision-transformer denoisers, augmented with text-encoder adapters to enable cross-architecture operation. Results show BRAT improves adherence to source images and, for ViT-based encoders, prompt adherence, revealing a Pareto frontier between content fidelity and prompt similarity. The work points to broader applicability of textual inversion across denoisers and the potential of adapters to leverage larger text encoders efficiently.

Abstract

Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at https://github.com/jamesBaker361/tex_inv_plus.

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

TL;DR

BRAT addresses the limitation of UNet-centric textual inversion by introducing a token-based, architecture-agnostic approach that learns orthogonal bonus tokens to complement the standard subject token. It combines a baseline LDM objective with a cosine-regularization term that enforces orthogonality between the placeholder token and the bonus token, and extends to multiple bonus tokens. The method is validated on subject and style personalization using both UNet and vision-transformer denoisers, augmented with text-encoder adapters to enable cross-architecture operation. Results show BRAT improves adherence to source images and, for ViT-based encoders, prompt adherence, revealing a Pareto frontier between content fidelity and prompt similarity. The work points to broader applicability of textual inversion across denoisers and the potential of adapters to leverage larger text encoders efficiently.

Abstract

Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at https://github.com/jamesBaker361/tex_inv_plus.
Paper Structure (26 sections, 3 equations, 9 figures, 8 tables)

This paper contains 26 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Subject Images, generated with caption "a photo of {} wearing sunglasses"
  • Figure 2: Style Images, generated with caption "a person with a city in the background, art by {}"
  • Figure 3: Source Images, labeled by their deviantart usernames
  • Figure 4: Images generated with the prompt "a photo of $\{\}$ eating a burger"
  • Figure 5: Images generated with prompt "a person with a mountain in the background, art by {}"
  • ...and 4 more figures