Materials Discovery using Max K-Armed Bandit
Nobuaki Kikkawa, Hiroshi Ohno
TL;DR
The paper reframes materials discovery as a max K-armed bandit problem to efficiently locate record-breaking material properties, rather than maximizing cumulative reward. It introduces a one-parameter algorithm that uses a pseudo-upper confidence bound on the expected improvement of the maximum reward as the arm-selection index, enabling time-independent operation and robust performance in late-stage searches. Theoretical contributions include demonstrating how the EI of the best reward can be bounded via survival functions and establishing a pseudo-UCB framework, along with Kikkawa's greedy oracle as a horizon-free benchmark. Empirically, the method outperforms traditional bandit approaches in synthetic extreme-value tasks and demonstrates strong performance in MCTS-based molecular design, highlighting its potential for scalable, discovery-focused materials exploration.
Abstract
Search algorithms for the bandit problems are applicable in materials discovery. However, the objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max K-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. Thus, here, we propose a search algorithm for materials discovery based on the MKB problem using a pseudo-value of the upper confidence bound of expected improvement of the best reward. This approach is pseudo-guaranteed to be asymptotic oracles that do not depends on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process when the optimal arm of the MKB could not be determined based on its expectation reward.
