Table of Contents
Fetching ...

Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning

Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, Zhuosheng Zhang, Rui Wang

TL;DR

A trajectory-based method TV score is proposed, which uses trajectory volatility for OOD detection in mathematical reasoning and outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.

Abstract

Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.

Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning

TL;DR

A trajectory-based method TV score is proposed, which uses trajectory volatility for OOD detection in mathematical reasoning and outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.

Abstract

Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.
Paper Structure (50 sections, 5 theorems, 35 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 50 sections, 5 theorems, 35 equations, 5 figures, 14 tables, 1 algorithm.

Key Result

Theorem 2.1

We assume that $\{{\boldsymbol y}_l\}_{l=1}^{L}$ are all independent variables sampling from vector space $\mathbb{R}^d$. For different samples $s_i$ and $s_j$, their embedding sets are $\{[{\boldsymbol y}_i]_l\}_{l=1}^{L}$ and $\{[{\boldsymbol y}_j]_l\}_{l=1}^{L}$, respectively. The likelihood of t

Figures (5)

  • Figure 1: Embedding projection and cases of input and output spaces under mathematical reasoning and text generation scenarios. We select MATH cobbe2021training dataset for mathematical reasoning and OPUS tiedemann2012parallel for text generation, each with four diverse domains. Different colors represent different domains, with lighter and darker shades indicating input and output. We use SimCSE gao2021simcse for sentence embeddings and UMAP mcinnes2018umap for dimensionality reduction. Appendix \ref{['sec:empirical_details']} shows detailed settings and examples.
  • Figure 2: The "pattern collapse" phenomenon only exists in mathematical reasoning scenarios, where two samples initially distant in distance will converge approximately at the endpoint after undergoing embedding shifts, and does not occur in text generation scenarios. This produces a greater likelihood of trajectory variation under different samples in mathematical reasoning.
  • Figure 3: Trajectory volatility curve comparisons between one ID data and ten OOD data from diverse mathematical domains. Each trajectory represents the average of all samples from the corresponding datasets, with color shading being the sample standard deviation. Llama2-7B is used for the backbone.
  • Figure 4: Smoothing order $k$ analysis: $k$ ranges from $0-5$ ($k=0$ corresponds to the original TV Score). The upper part is for the OOD detection scenario and the lower part is for the OOD quality estimation scenario; the left part is for the far-shift OOD datasets and the right part is for the near-shift OOD datasets.
  • Figure 5: AUROC score matrix in MMLU dataset of different OOD scores. Rows represent ID data, and columns represent OOD data.

Theorems & Definitions (5)

  • Theorem 2.1: Main Theorem
  • Theorem C.1: Main Theorem
  • Proposition C.1: Lagrange Remainder Term
  • Lemma C.1: Error Bound for the Midpoint Rule
  • Lemma C.2: Differential-Integral Error Order Estimation