Table of Contents
Fetching ...

Teaching Metric Distance to Discrete Autoregressive Language Models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu

TL;DR

This work addresses the limitation of traditional autoregressive models that treat tokens as purely discrete one-hot targets, ignoring underlying metric relationships when outputs are numeric, spatial, or embedded. It introduces DIST2Loss, a distance-aware objective that constructs a discretized distance-based target distribution $p_d(v|x,t)$ from a distance function $d$ and optimizes it via KL divergence, effectively implementing a closed-form analogue to entropy-regularized policy optimization. The approach unifies cross-entropy with a distance-based regularizer, supporting high-dimensional distances and vector-quantized representations, and is demonstrated across toy regression, visual grounding, robotic manipulation, reward modeling, and image generation, with clear gains in low-data regimes. Practically, DIST2Loss is plug-and-play, data-efficient, and compatible with existing backbones and architectures, enabling more faithful generation of metric-aware outputs without additional data or RL instability.

Abstract

As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.

Teaching Metric Distance to Discrete Autoregressive Language Models

TL;DR

This work addresses the limitation of traditional autoregressive models that treat tokens as purely discrete one-hot targets, ignoring underlying metric relationships when outputs are numeric, spatial, or embedded. It introduces DIST2Loss, a distance-aware objective that constructs a discretized distance-based target distribution from a distance function and optimizes it via KL divergence, effectively implementing a closed-form analogue to entropy-regularized policy optimization. The approach unifies cross-entropy with a distance-based regularizer, supporting high-dimensional distances and vector-quantized representations, and is demonstrated across toy regression, visual grounding, robotic manipulation, reward modeling, and image generation, with clear gains in low-data regimes. Practically, DIST2Loss is plug-and-play, data-efficient, and compatible with existing backbones and architectures, enabling more faithful generation of metric-aware outputs without additional data or RL instability.

Abstract

As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.

Paper Structure

This paper contains 66 sections, 11 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Tasks outside of language often require outputs with metric structure, for example quantities or coordinates, making distance-aware modeling advantageous.
  • Figure 2: DIST2Loss finetunes discrete autoregressive models with a distance-aware target distribution instead of a one-hot target. The procedure is: (a) define a token distance metric $d(x, x')$, (b) convert the metric into a continuous distribution $p(x, x')$, (c) discretize the distribution to obtain $p_d(x, x')$, and (d) compute the KL divergence loss between the target $p_d$ and the model likelihood $p_\theta$ per token.
  • Figure 3: (Left) Experimental results showing MAE and RMSE across varying numbers of training samples. The y-axis is inverted for visualization. (Right) Overview of the task setup in the meta linear regression experiment, where the model learns to perform linear regression based on the data points.
  • Figure 4: Illustration of token distance effects on image semantics. Each row shows VQ-encoded images with four central tokens replaced by: the original, a nearby token (top-10), a random token, and a distant token (bottom-10). Nearby tokens preserve semantics; random or distant ones cause distortions or semantic shifts.
  • Figure 5: Qualitative examples from the generative reward modeling experiment.
  • ...and 3 more figures