Table of Contents
Fetching ...

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

Rui Melo, Rui Abreu, Corina S. Pasareanu

TL;DR

This work introduces Microsaccade-Inspired Probing (MIP), a lightweight, model-agnostic technique that perturbs positional encodings to reveal latent signals of LLM misbehaviour without fine-tuning. By analyzing changes in next-token distributions and attention patterns, MIP detects factuality violations, jailbreaks, toxicity, and backdoors across multiple state-of-the-art LLMs, often achieving near-perfect separability and exposing discriminative signals in mid-to-late layers. Visualizations (PCA, LDA) and head-wise attribution demonstrate that misbehaviour signatures are localized and interpretable within the model's internal representations. The approach offers a practical, efficient diagnostic tool with potential for real-time monitoring and future extensions toward active steering and mitigation of undesirable outputs.

Abstract

We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

TL;DR

This work introduces Microsaccade-Inspired Probing (MIP), a lightweight, model-agnostic technique that perturbs positional encodings to reveal latent signals of LLM misbehaviour without fine-tuning. By analyzing changes in next-token distributions and attention patterns, MIP detects factuality violations, jailbreaks, toxicity, and backdoors across multiple state-of-the-art LLMs, often achieving near-perfect separability and exposing discriminative signals in mid-to-late layers. Visualizations (PCA, LDA) and head-wise attribution demonstrate that misbehaviour signatures are localized and interpretable within the model's internal representations. The approach offers a practical, efficient diagnostic tool with potential for real-time monitoring and future extensions toward active steering and mitigation of undesirable outputs.

Abstract

We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.

Paper Structure

This paper contains 43 sections, 12 equations, 34 figures, 1 table.

Figures (34)

  • Figure 1: Overview of the proposed intervention and probing mechanisms.
  • Figure 2: Comparison of intervention effects visualized with PCA. Llama-3.1-8B-Instruct
  • Figure 3: Comparison of intervention effects visualized with supervised LDA (Llama-3.1-8B-Instruct).
  • Figure 4: Head-wise attribution analysis across backdoor datasets (VIP, MTBA, Sleeper). Left: Effect size (Cohen’s $d$). Right: Discriminability (AUC). Both reveal localized mid-to-late layer heads as carrying the strongest signals.
  • Figure 5: Cumulative FLOPs over Sleeper Dataset using Llama-3.1-8B-Instruct.
  • ...and 29 more figures