Table of Contents
Fetching ...

Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli, Simone Sestito, Iacopo Masi

TL;DR

Inverse Language Modeling (ILM) introduces a bidirectional training objective that pairs forward next-token prediction with backward prompt reconstruction, guided by Perceptually Aligned Gradients (PAG). By training LLMs to invert outputs back toward their inputs, ILM improves robustness to adversarial prompts and enhances grounding, enabling better detection of potentially unsafe inputs and more controllable behavior. The method yields multiple model variants (Baseline, Identity, Bert-like, Inv-First) and demonstrates improved resistance to GCG attacks while preserving forward-mode fluency. This approach promises more trustworthy, grounded LLMs and provides a framework for safer red-teaming and analysis of prompt-triggered failures, with public code and scalability to larger models as future work.

Abstract

The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

Inverse Language Modeling towards Robust and Grounded LLMs

TL;DR

Inverse Language Modeling (ILM) introduces a bidirectional training objective that pairs forward next-token prediction with backward prompt reconstruction, guided by Perceptually Aligned Gradients (PAG). By training LLMs to invert outputs back toward their inputs, ILM improves robustness to adversarial prompts and enhances grounding, enabling better detection of potentially unsafe inputs and more controllable behavior. The method yields multiple model variants (Baseline, Identity, Bert-like, Inv-First) and demonstrates improved resistance to GCG attacks while preserving forward-mode fluency. This approach promises more trustworthy, grounded LLMs and provides a framework for safer red-teaming and analysis of prompt-triggered failures, with public code and scalability to larger models as future work.

Abstract

The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 17 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of Inverse Language Modeling (ILM) setup. Forward pass predicts next tokens, backward pass reconstructs inputs from gradients.
  • Figure 2: Gradient influence diagram in an LLM changing only token $e_3$. In the forward pass, only future hidden states are affected. In the backward pass, the change propagates to every embedding token.
  • Figure 3: Parallelism between last hidden states and embedding gradients: both can be mapped through LM head to token predictions.
  • Figure 4: GCG Success Rate varying according to the number of iterations performed.