Table of Contents
Fetching ...

Anticipating Oblivious Opponents in Stochastic Games

Shadi Tasdighi Kalat, Sriram Sankaranarayanan, Ashutosh Trivedi

TL;DR

This work tackles anticipating actions of oblivious environments in concurrent stochastic games by constructing an information-state machine (ISM) that links automaton states to beliefs about the environment's policy. The ISM is built to satisfy a lambda-consistency guarantee, ensuring the tracked beliefs stay close to those obtained from full histories, and is used to derive an MDP that enables optimal policy synthesis for the observing agent. The framework includes a semi-algorithm to synthesize consistent ISMs, robustness bounds against policy deviations, and a connection to approximate information states for theoretical guarantees on value differences. Empirical evaluations on rock-paper-scissors, IKEA furniture assembly, and cataract surgery datasets demonstrate the approach's ability to anticipate the environment's policies and actions and improve reward and next-action prediction under realistic task conditions.

Abstract

We present an approach for systematically anticipating the actions and policies employed by \emph{oblivious} environments in concurrent stochastic games, while maximizing a reward function. Our main contribution lies in the synthesis of a finite \emph{information state machine} whose alphabet ranges over the actions of the environment. Each state of the automaton is mapped to a belief state about the policy used by the environment. We introduce a notion of consistency that guarantees that the belief states tracked by our automaton stays within a fixed distance of the precise belief state obtained by knowledge of the full history. We provide methods for checking consistency of an automaton and a synthesis approach which upon successful termination yields such a machine. We show how the information state machine yields an MDP that serves as the starting point for computing optimal policies for maximizing a reward function defined over plays. We present an experimental evaluation over benchmark examples including human activity data for tasks such as cataract surgery and furniture assembly, wherein our approach successfully anticipates the policies and actions of the environment in order to maximize the reward.

Anticipating Oblivious Opponents in Stochastic Games

TL;DR

This work tackles anticipating actions of oblivious environments in concurrent stochastic games by constructing an information-state machine (ISM) that links automaton states to beliefs about the environment's policy. The ISM is built to satisfy a lambda-consistency guarantee, ensuring the tracked beliefs stay close to those obtained from full histories, and is used to derive an MDP that enables optimal policy synthesis for the observing agent. The framework includes a semi-algorithm to synthesize consistent ISMs, robustness bounds against policy deviations, and a connection to approximate information states for theoretical guarantees on value differences. Empirical evaluations on rock-paper-scissors, IKEA furniture assembly, and cataract surgery datasets demonstrate the approach's ability to anticipate the environment's policies and actions and improve reward and next-action prediction under realistic task conditions.

Abstract

We present an approach for systematically anticipating the actions and policies employed by \emph{oblivious} environments in concurrent stochastic games, while maximizing a reward function. Our main contribution lies in the synthesis of a finite \emph{information state machine} whose alphabet ranges over the actions of the environment. Each state of the automaton is mapped to a belief state about the policy used by the environment. We introduce a notion of consistency that guarantees that the belief states tracked by our automaton stays within a fixed distance of the precise belief state obtained by knowledge of the full history. We provide methods for checking consistency of an automaton and a synthesis approach which upon successful termination yields such a machine. We show how the information state machine yields an MDP that serves as the starting point for computing optimal policies for maximizing a reward function defined over plays. We present an experimental evaluation over benchmark examples including human activity data for tasks such as cataract surgery and furniture assembly, wherein our approach successfully anticipates the policies and actions of the environment in order to maximize the reward.
Paper Structure (3 sections, 2 figures)

This paper contains 3 sections, 2 figures.

Figures (2)

  • Figure 1: Rock-paper-scissors (RPS) game arena. Here actions $r_i$, $p_i$, and $s_i$ correspond to the choices of "rock", "paper" and "scissors" by player ${\cal P}_{i}$; (left) Reward table; (mid) player ${\cal P}_{2}$ policies; and (right) Markov chain modeling policy change for ${\cal P}_{2}$. The dashed red edges have probability $0.15$ whereas the solid edges have probability $0.12$.
  • Figure 2: States of a furniture assembly task and policies for task completion.

Theorems & Definitions (1)

  • definition thmcounterdefinition: Concurrent Stochastic Game Arena: Syntax