Table of Contents
Fetching ...

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Wai Tuck Wong, Jun Sun, Arunesh Sinha

TL;DR

This work studies a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models.

Abstract

The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

TL;DR

This work studies a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models.

Abstract

The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.
Paper Structure (20 sections, 1 theorem, 20 equations, 9 figures, 4 tables)

This paper contains 20 sections, 1 theorem, 20 equations, 9 figures, 4 tables.

Key Result

Lemma 3.1

Let $f:\mathbb{R}^d\to\mathbb{R}^k$ be locally Lipschitz on an open set $U \subset \mathbb{R}^d$ with Lipschitz cosntant $L$. Assume IEEE 754 style rounding as stated above and generalized to multiple dimension as $(\operatorname{fl}(x))_j =x_j (1 + \eta_j )$ with $|\eta_j |\le \epsilon$ for $x = (x where the $O(\epsilon^2)$ term is uniform on $U$ (i.e., does not depend on $x$).

Figures (9)

  • Figure 1: We sample 300 input pairs, drawn uniformly from logarithmic space for multiplication and from linear space for addition, respectively, starting from the minimum and to the maximum positive values of a half-precision ($\mathrm{float16}$) number. We compare the absolute differences when we perform operations with both full and half precision as the data type We see that numerical imprecision generally increases as values increase for multiplication, meaning larger inputs values yield more numerical errors.
  • Figure 2: A two-layer $\tanh$ network illustrating how a small input perturbation $\delta$ produces amplified change at the output due to asymmetric activation saturation. When $x_1 = x_2 = 0$ and $x_2$ is perturbed slightly by $\delta$, the output becomes $y = - 2\tanh(4\delta)$. In this example the output around this region is around 8x the initial magnitude of the input perturbation, demonstrating local sensitivity at that region that is inherent within the network that cannot be resolved by increasing floating point precision.
  • Figure 3: Qualitative comparison of different perturbation types and their corresponding activation responses. The first row shows input images under four settings—clean, FGSM, PGD, and numerically unstable (ours). The second row presents activation maps extracted from the first vision encoder layer of LLaVA-v1.5-7B. While adversarial perturbations (FGSM, PGD) introduce localized distortions, our numerically unstable input induces diffuse and misaligned attention, indicating degradation through a fundamentally different mechanism.
  • Figure 4: Examples of outputs under clean and numerically unstable inputs ($\varepsilon=16/255$) for the Idefics3-8B model. Each column shows the perturbed input image, the associated question, and corresponding responses. The perturbed responses deviate semantically from both the ground truth and the model’s original answers on the clean image, despite nearly identical inputs. Additional examples for other models can be found in Appendix \ref{['sec:additional_attack_outputs']}.
  • Figure 5: Implementation-level numerical error on MSCOCO for Idefics3-8B. We plot accumulated absolute differences between $\mathrm{float32}$ and $\mathrm{float16}$ forward passes (summed over operations and samples) versus training epoch (optimizing Eqn. \ref{['eq:approx_loss']}). Numerical error increases over epochs under our proxy loss.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Lemma 3.1: Forward error bound under floating-point input rounding and result rounding
  • proof