Silent Tokens, Loud Effects: Padding in LLMs
Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson
TL;DR
Padding tokens, intended to be inert in batched LLM inference, can influence hidden representations, generation quality, bias, and safety when mishandled. The authors systematically evaluate padding by prepending pad tokens to inputs across 10 models in the Llama, Gemma, and Qwen families, using explicit attention masks to treat pads as inputs, and measuring activation drift, generation quality with BLEU and BERTScore, bias with BBQ, and safety with ASR. They find that even small padding values cause activation drift and degrade generation quality in smaller models, induce context-dependent bias shifts, and weaken safety guardrails, with larger padding enabling more harmful generations. These results argue for strict padding handling in deployment and suggest directions for padding-robust training.
Abstract
Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.
