BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion
James Baker
TL;DR
BRAT addresses the limitation of UNet-centric textual inversion by introducing a token-based, architecture-agnostic approach that learns orthogonal bonus tokens to complement the standard subject token. It combines a baseline LDM objective with a cosine-regularization term that enforces orthogonality between the placeholder token and the bonus token, and extends to multiple bonus tokens. The method is validated on subject and style personalization using both UNet and vision-transformer denoisers, augmented with text-encoder adapters to enable cross-architecture operation. Results show BRAT improves adherence to source images and, for ViT-based encoders, prompt adherence, revealing a Pareto frontier between content fidelity and prompt similarity. The work points to broader applicability of textual inversion across denoisers and the potential of adapters to leverage larger text encoders efficiently.
Abstract
Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at https://github.com/jamesBaker361/tex_inv_plus.
