Table of Contents
Fetching ...

SafeMath: Inference-time Safety improves Math Accuracy

Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee

Abstract

Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.

SafeMath: Inference-time Safety improves Math Accuracy

Abstract

Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.

Paper Structure

This paper contains 26 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Model response before and after applying the interventions. Our model gives safe and helpful response not sacrificing mathematical correctness.
  • Figure 2: Distribution of of harm categories in the ToxicGSM dataset.
  • Figure 3: Overview of the SafeMath framework. We learn two ICVs using PCA over hidden representations of contrastively paired examples: ICVsf and ICVma. These vectors are injected into the base model at inference time after scaling them by coefficients $\alpha$ and $\beta$.
  • Figure 4: Radar plots for safety evaluations in LlamaMath (left), DeepSeekMath (middle) and Qwen2Math (right) models.
  • Figure 5: Attribution score difference $\theta_\textsc{SF}$ and $\theta_\textsc{M}$ centred around the numerical tokens.
  • ...and 4 more figures