Table of Contents
Fetching ...

Neural Risk-sensitive Satisficing in Contextual Bandits

Shogo Ito, Tatsuji Takahashi, Yu Kono

TL;DR

This work addresses the challenges of contextual bandits in large state-action spaces by extending risk-sensitive satisficing from RegLinRS to NeuralRS. NeuralRS integrates neural networks to model non-linear reward mappings and introduces a local reliability mechanism via latent-space centroids to balance exploration and exploitation, while also evaluating multiple reliability candidates. Across artificial and real-world shuttle datasets, NeuralRS consistently achieves lower regret than linear and some NN-based baselines, with k-means reliability offering a favorable trade-off between performance and efficiency. The results suggest NeuralRS is well-suited for real-time, personalization-oriented applications where exploration costs are high and non-linear relationships abound.

Abstract

The contextual bandit problem, which is a type of reinforcement learning tasks, provides an effective framework for solving challenges in recommendation systems, such as satisfying real-time requirements, enabling personalization, addressing cold-start problems. However, contextual bandit algorithms face challenges since they need to handle large state-action spaces sequentially. These challenges include the high costs for learning and balancing exploration and exploitation, as well as large variations in performance that depend on the domain of application. To address these challenges, Tsuboya et~al. proposed the Regional Linear Risk-sensitive Satisficing (RegLinRS) algorithm. RegLinRS switches between exploration and exploitation based on how well the agent has achieved the target. However, the reward expectations in RegLinRS are linearly approximated based on features, which limits its applicability when the relationship between features and reward expectations is non-linear. To handle more complex environments, we proposed Neural Risk-sensitive Satisficing (NeuralRS), which incorporates neural networks into RegLinRS, and demonstrated its utility.

Neural Risk-sensitive Satisficing in Contextual Bandits

TL;DR

This work addresses the challenges of contextual bandits in large state-action spaces by extending risk-sensitive satisficing from RegLinRS to NeuralRS. NeuralRS integrates neural networks to model non-linear reward mappings and introduces a local reliability mechanism via latent-space centroids to balance exploration and exploitation, while also evaluating multiple reliability candidates. Across artificial and real-world shuttle datasets, NeuralRS consistently achieves lower regret than linear and some NN-based baselines, with k-means reliability offering a favorable trade-off between performance and efficiency. The results suggest NeuralRS is well-suited for real-time, personalization-oriented applications where exploration costs are high and non-linear relationships abound.

Abstract

The contextual bandit problem, which is a type of reinforcement learning tasks, provides an effective framework for solving challenges in recommendation systems, such as satisfying real-time requirements, enabling personalization, addressing cold-start problems. However, contextual bandit algorithms face challenges since they need to handle large state-action spaces sequentially. These challenges include the high costs for learning and balancing exploration and exploitation, as well as large variations in performance that depend on the domain of application. To address these challenges, Tsuboya et~al. proposed the Regional Linear Risk-sensitive Satisficing (RegLinRS) algorithm. RegLinRS switches between exploration and exploitation based on how well the agent has achieved the target. However, the reward expectations in RegLinRS are linearly approximated based on features, which limits its applicability when the relationship between features and reward expectations is non-linear. To handle more complex environments, we proposed Neural Risk-sensitive Satisficing (NeuralRS), which incorporates neural networks into RegLinRS, and demonstrated its utility.
Paper Structure (22 sections, 15 equations, 5 figures, 1 table)

This paper contains 22 sections, 15 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of Centroid Update in k-means.
  • Figure 2: Comparison of NeuralRS and Baseline Algorithms on an Artificial Dataset.
  • Figure 3: Comparison of NeuralRS and Baseline Algorithms on the Statlog-Shuttle Dataset.
  • Figure 4: Comparison of Reliability Candidates for NeuralRS on an Artificial Dataset.
  • Figure 5: Comparison of Reliability Candidates for NeuralRS on the Statlog-Shuttle Dataset.