CANDID DAC: Leveraging Coupled Action Dimensions with Importance Differences in DAC
Philipp Bordne, M. Asif Hasan, Eddie Bergman, Noor Awad, André Biedenkapp
TL;DR
This paper addresses dynamic algorithm configuration (DAC) in high-dimensional action spaces where action dimensions are coupled and differ in importance, a setting it calls CANDID. It introduces a white-box Piecewise Linear benchmark within the DACBench suite to instantiate CANDID properties via a weighted aggregation across dimensions (weights $w_m = \lambda^{m-1}$) and an exponential reward $r_t = e^{-c \cdot \text{prederror}(a_t^{1:M})}$, with targets defined by a piecewise linear function over time steps and dimension combinations. To tackle the resulting coordination challenge, the authors develop sequential policies (SDQN-inspired) that learn a policy per action dimension and condition on previously chosen actions, specifically SAQL and simSDQN, and compare them to a single-agent DDQN baseline and an independent Q-learning baseline. Experiments show that sequential policies achieve superior performance in CANDID settings, scale better with increasing action-space size, and benefit from ordering action selection by importance, suggesting a viable path for coordinating high-dimensional DAC problems in practice. The work provides publicly available code and motivates further integration with state-of-the-art MARL methods and communication strategies to further enhance scalability and coordination.
Abstract
High-dimensional action spaces remain a challenge for dynamic algorithm configuration (DAC). Interdependencies and varying importance between action dimensions are further known key characteristics of DAC problems. We argue that these Coupled Action Dimensions with Importance Differences (CANDID) represent aspects of the DAC problem that are not yet fully explored. To address this gap, we introduce a new white-box benchmark within the DACBench suite that simulates the properties of CANDID. Further, we propose sequential policies as an effective strategy for managing these properties. Such policies factorize the action space and mitigate exponential growth by learning a policy per action dimension. At the same time, these policies accommodate the interdependence of action dimensions by fostering implicit coordination. We show this in an experimental study of value-based policies on our new benchmark. This study demonstrates that sequential policies significantly outperform independent learning of factorized policies in CANDID action spaces. In addition, they overcome the scalability limitations associated with learning a single policy across all action dimensions. The code used for our experiments is available under https://github.com/PhilippBordne/candidDAC.
