Assessing the Interpretability of Programmatic Policies with Large Language Models
Zahra Bashir, Michael Bowling, Levi H. S. Lelis
TL;DR
The paper addresses the challenge of evaluating interpretability for programmatic policies encoded as domain-specific language programs. It proposes a scalable, inexpensive metric that leverages large language models: given a program and its language description, an LLM generates a natural-language explanation, which a second LLM then attempts to reconstruct back into a program; interpretability is quantified by the behavioral similarity between the original and reconstructed programs. The approach is validated on synthesized and human-crafted MicroRTS policies, including obfuscated variants, demonstrating that interpretability scores align with intuitive interpretability distinctions. The work provides a practical framework for assessing interpretability of policy-encoding programs and highlights the role of DSL definitions and prompts in shaping interpretability outcomes, with potential impact on policy design and evaluation tooling.
Abstract
Although the synthesis of programs encoding policies often carries the promise of interpretability, systematic evaluations were never performed to assess the interpretability of these policies, likely because of the complexity of such an evaluation. In this paper, we introduce a novel metric that uses large-language models (LLM) to assess the interpretability of programmatic policies. For our metric, an LLM is given both a program and a description of its associated programming language. The LLM then formulates a natural language explanation of the program. This explanation is subsequently fed into a second LLM, which tries to reconstruct the program from the natural-language explanation. Our metric then measures the behavioral similarity between the reconstructed program and the original. We validate our approach with synthesized and human-crafted programmatic policies for playing a real-time strategy game, comparing the interpretability scores of these programmatic policies to obfuscated versions of the same programs. Our LLM-based interpretability score consistently ranks less interpretable programs lower and more interpretable ones higher. These findings suggest that our metric could serve as a reliable and inexpensive tool for evaluating the interpretability of programmatic policies.
