Table of Contents
Fetching ...

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, Yonatan Belinkov

TL;DR

Padding tokens in text-to-image diffusion models can influence generation when encoded alongside prompts. The authors develop two causal techniques, ITE and IDP, to causally intervene on text-encoder and diffusion representations and quantify pad-token contributions. They find that in frozen-text-encoder models padding tokens are largely ignored, but in trained-text-encoder models, pads can carry semantic information and can act as memory-like registers during diffusion in multi-modal self-attention architectures. The work has practical implications for training and deploying T2I systems, suggesting padding-aware design, training, and data preprocessing may be warranted.

Abstract

Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model's architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

TL;DR

Padding tokens in text-to-image diffusion models can influence generation when encoded alongside prompts. The authors develop two causal techniques, ITE and IDP, to causally intervene on text-encoder and diffusion representations and quantify pad-token contributions. They find that in frozen-text-encoder models padding tokens are largely ignored, but in trained-text-encoder models, pads can carry semantic information and can act as memory-like registers during diffusion in multi-modal self-attention architectures. The work has practical implications for training and deploying T2I systems, suggesting padding-aware design, training, and data preprocessing may be warranted.

Abstract

Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model's architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.
Paper Structure (26 sections, 5 equations, 12 figures, 5 tables)

This paper contains 26 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Images generated with FLUX from different segments of the input prompt. Description of each column, from left to right: (1) An image generated using the full prompt (both prompt tokens and padding tokens encoded together), (2) An image generated using only the prompt tokens and clean padding tokens, (3) An image generated using only the prompt-contextual pads encoded with the prompt, while the prompt tokens were replaced with clean pad tokens.
  • Figure 2: The scenarios we observe: padding tokens may be effectively ignored (first row; image generated using ITE), affect the model's output during text encoding (second row; image generated using ITE), or be used during the diffusion process (last row; image generated using IDP). Left: baseline. Right: our method.
  • Figure 3: ITE: Interpreting information within pad tokens in the text encoder. We first encode the full prompt and the clean pads separately. Next, we keep the tokens we want to interpret and replace all other tokens with clean pad tokens. We then generate an image conditioned on this mixed representation. In the example shown here, we interpret the pad tokens in LLaMA-UNet, revealing semantic information embedded within the pad tokens.
  • Figure 4: Images generated from different segments of the input prompt using ITE. Description of each column, from left to right: (1) An image generated using the full prompt (both prompt tokens and padding tokens encoded together), (2) An image generated using only the prompt tokens and clean padding tokens, (3) An image generated using only the prompt-contextual pads encoded with the prompt, while the prompt tokens were replaced with clean pad tokens.
  • Figure 5: Average CLIP score over 5,000 images generated from the different representations: full prompt, only prompt, prompt-contextual pads and clean pads using ITE. LDM and LLaMA-UNet are the only models achieving high CLIP scores for images generated from padding tokens, indicating their use during text encoding. See Table \ref{['app:tab:main_std']} in the Appendix for standard deviations.
  • ...and 7 more figures