Table of Contents
Fetching ...

MIRAGE: Online LLM Simulation for Microservice Dependency Testing

XinRan Zhang

Abstract

Existing approaches to microservice dependency simulation--record-replay, pattern-mining, and specification-driven stubs--generate static artifacts before test execution. We propose online LLM simulation, a runtime approach where the LLM directly answers each dependency request as it arrives, maintaining cross-request state throughout a test scenario. No mock specification is pre-generated; the model reads the dependency's source code, caller code, and production traces, then simulates dependency behavior on demand. We instantiate this approach in MIRAGE and evaluate it on 110 test scenarios spanning 14 caller-dependency pairs across three microservice systems (Google's Online Boutique, Weaveworks' Sock Shop, and a custom system). In white-box mode (dependency source available), MIRAGE achieves 99% status-code fidelity (109/110) and 99% response-shape fidelity, compared to 62% / 16% for record-replay. End-to-end, caller integration tests produce the same pass/fail outcomes with MIRAGE as with real dependencies (8/8 scenarios). A signal ablation shows dependency source code is often sufficient for high-fidelity runtime simulation (100% alone); without it, the model still infers correct error codes (94%) but loses response-structure accuracy (75%). Constraining LLM output through typed intermediate representations reduces fidelity on complex stateful services (55%) while performing adequately on simple APIs (86%), suggesting that the runtime approach's implicit state tracking matters for behavioral complexity. Results are stable across three LLM families (within 3%) at $0.16 to $0.82 per dependency.

MIRAGE: Online LLM Simulation for Microservice Dependency Testing

Abstract

Existing approaches to microservice dependency simulation--record-replay, pattern-mining, and specification-driven stubs--generate static artifacts before test execution. We propose online LLM simulation, a runtime approach where the LLM directly answers each dependency request as it arrives, maintaining cross-request state throughout a test scenario. No mock specification is pre-generated; the model reads the dependency's source code, caller code, and production traces, then simulates dependency behavior on demand. We instantiate this approach in MIRAGE and evaluate it on 110 test scenarios spanning 14 caller-dependency pairs across three microservice systems (Google's Online Boutique, Weaveworks' Sock Shop, and a custom system). In white-box mode (dependency source available), MIRAGE achieves 99% status-code fidelity (109/110) and 99% response-shape fidelity, compared to 62% / 16% for record-replay. End-to-end, caller integration tests produce the same pass/fail outcomes with MIRAGE as with real dependencies (8/8 scenarios). A signal ablation shows dependency source code is often sufficient for high-fidelity runtime simulation (100% alone); without it, the model still infers correct error codes (94%) but loses response-structure accuracy (75%). Constraining LLM output through typed intermediate representations reduces fidelity on complex stateful services (55%) while performing adequately on simple APIs (86%), suggesting that the runtime approach's implicit state tracking matters for behavioral complexity. Results are stable across three LLM families (within 3%) at 0.82 per dependency.

Paper Structure

This paper contains 41 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Main results. (a) Status-code fidelity: Mirage (WB) achieves 99% across all three benchmarks; replay achieves 62% (also all three); Pattern and IR are Demo-only (marked with *). (b) Response-shape fidelity (subset with body data): replay returns structurally wrong responses (16%) while Mirage (WB) matches real body shapes on 99% of scenarios. TG = Mirage.
  • Figure 2: Structured IR vs. online simulation on Demo by category. IR achieves only 29% on stateful scenarios while Mirage achieves 100%. IR helps on error handling (88%) but fails where implicit state tracking is needed.
  • Figure 3: Signal ablation (OB+SS combined). Status fidelity remains $\geq$92% across all signal configurations, but body-shape fidelity drops from 100% to 53--69% without dependency source code. The body metric captures a quality gradient invisible to status codes.

Theorems & Definitions (1)

  • Definition 1: caller-adequate simulator