Table of Contents
Fetching ...

Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Jiayi Geng, Howard Chen, Dilip Arumugam, Thomas L. Griffiths

TL;DR

This work formalizes reverse-engineering as a core test for AI scientists by evaluating LLMs on three controlled black-box tasks: Program, Formal Language, and Math Equation. It shows that LLMs under mostly passive observation perform far below Bayesian inference, but active interventions substantially improve hypothesis testing and refinement, mitigating overcomplication and overlooking to some extent. However, even with interventions, LLMs do not consistently reach Bayesian-optimal performance, and benefits from intervention data are often model-specific with limited transfer to other LLMs. The findings provide practical guidance for designing LLM-assisted scientific workflows that emphasize active data collection and careful data sharing to enhance reliability in discovery tasks.

Abstract

Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

TL;DR

This work formalizes reverse-engineering as a core test for AI scientists by evaluating LLMs on three controlled black-box tasks: Program, Formal Language, and Math Equation. It shows that LLMs under mostly passive observation perform far below Bayesian inference, but active interventions substantially improve hypothesis testing and refinement, mitigating overcomplication and overlooking to some extent. However, even with interventions, LLMs do not consistently reach Bayesian-optimal performance, and benefits from intervention data are often model-specific with limited transfer to other LLMs. The findings provide practical guidance for designing LLM-assisted scientific workflows that emphasize active data collection and careful data sharing to enhance reliability in discovery tasks.

Abstract

Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

Paper Structure

This paper contains 57 sections, 15 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Reverse-engineering. Left: Defining the problem. The AI scientist will obtain either passive observations from the black box or collect data through active intervention to construct a hypothesis. Right (top): with only passive observations, the LLM cannot make effective use of the data and lags behind Bayesian inference by large margin; allowing the LLM to intervene improves performance. Right (bottom): effective intervention can mitigate two common failure modes: overcomplication and overlooking.
  • Figure 2: Observation-only results across three black-box types. We compare the GPT-4o performance (blue) to Bayesian inference (green). The horizontal-axis represents the number of provided $(x, y)$ pairs. We report 1 - RMSE for Math Equation and descriptive score for Program and Formal Language.
  • Figure 3: Observation-intervention results across three black-box types. Red: observations and interventions by GPT-4o. Yellow: taking the observation-intervention collected from GPT-4o as observations for the Bayesian inference algorithms. Dashed lines: observation-only reference for GPT-4o (blue) and Bayesian inference (green).
  • Figure 4: Comparing intervention-yoked results with observation-only and observation-intervention across three black-box types.
  • Figure 5: Descriptive scores for five different complexity levels. Averaged across three seeds for each of the three black-box types.
  • ...and 5 more figures