Table of Contents
Fetching ...

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Tuan Nguyen, Long Tran-Thanh

TL;DR

The paper tackles the challenge of safety alignment for black-box LLMs by formulating the safety–helpfulness trade-off as a two-player zero-sum game and solving for equilibrium strategies with a linear program at inference time. It avoids retraining and internal access by operating over a finite candidate set and using model-agnostic probes to estimate helpfulness and safety margins, M_i and Δ_i. A bounded-multiplier reformulation with a sigmoid penalty softens the risk cap and yields a two-branch interpretation that balances informative responses with safety constraints. Empirical results on HHH, TruthfulQA, and SafetyBench across multiple open models show that Safety Game (SG) often outperforms state-of-the-art decoding/ranking baselines, particularly on the large SafetyBench dataset, while maintaining or improving helpfulness on smaller benchmarks. The work demonstrates a scalable, accessible pathway for third-party stakeholders to enforce safety in rapidly evolving LLM ecosystems without model modification or retraining.

Abstract

Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

TL;DR

The paper tackles the challenge of safety alignment for black-box LLMs by formulating the safety–helpfulness trade-off as a two-player zero-sum game and solving for equilibrium strategies with a linear program at inference time. It avoids retraining and internal access by operating over a finite candidate set and using model-agnostic probes to estimate helpfulness and safety margins, M_i and Δ_i. A bounded-multiplier reformulation with a sigmoid penalty softens the risk cap and yields a two-branch interpretation that balances informative responses with safety constraints. Empirical results on HHH, TruthfulQA, and SafetyBench across multiple open models show that Safety Game (SG) often outperforms state-of-the-art decoding/ranking baselines, particularly on the large SafetyBench dataset, while maintaining or improving helpfulness on smaller benchmarks. The work demonstrates a scalable, accessible pathway for third-party stakeholders to enforce safety in rapidly evolving LLM ecosystems without model modification or retraining.

Abstract

Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

Paper Structure

This paper contains 32 sections, 2 theorems, 15 equations, 1 figure, 9 tables, 2 algorithms.

Key Result

Proposition 3.1

Assume there exists some candidate $j$ with $M_j>0$ and $\Delta_j>0$, and that the unconstrained maximizer of $M(\pi)$ violates $R(\pi)\le T$. Then every optimizer $\pi^\star$ of the eq:lag satisfies $R(\pi^\star)=T$.

Figures (1)

  • Figure 1: Reward distributions on HHH. SG (Sigmoid) concentrates near the HHH reference mean (dashed line), exhibit a positive skew, and substantially suppress the negative left tail compared to baselines.

Theorems & Definitions (2)

  • Proposition 3.1: Boundary selection under tradeoff
  • Proposition 3.2: Boundary sensitivity