Table of Contents
Fetching ...

The Observability Gap: Why Output-Level Human Feedback Fails for LLM Coding Agents

Yinghao Wang, Cheng Wang

Abstract

Large language model (LLM) multi-agent coding systems typically fix agent capabilities at design time. We study an alternative setting, earned autonomy, in which a coding agent starts with zero pre-defined functions and incrementally builds a reusable function library through lightweight human feedback on visual output alone. We evaluate this setup in a Blender-based 3D scene generation task requiring both spatial reasoning and programmatic geometric control. Although the agent rediscovered core utility functions comparable to a human reference implementation, it achieved 0% full-scene success under output-only feedback across multiple instruction granularities, where success required satisfying object completeness, ground contact, collision avoidance, and scale plausibility simultaneously. Our analysis identifies a structural observability gap: bugs originate in code logic and execution state, while human evaluation occurs only at the output layer, and the many-to-one mapping from internal states to visible outcomes prevents symptom-level feedback from reliably identifying root causes. This mismatch leads to persistent failure mode oscillation rather than convergence. A diagnostic intervention that injected minimal code-level knowledge restored convergence, strongly supporting the interpretation that the main bottleneck lies in feedback observability rather than programming competence. We formalize this phenomenon as a feedback paradox in domains with deep causal chains between internal code logic and perceptual outcomes, and argue that effective human-agent collaboration in such settings requires intermediate observability beyond output-only evaluation.

The Observability Gap: Why Output-Level Human Feedback Fails for LLM Coding Agents

Abstract

Large language model (LLM) multi-agent coding systems typically fix agent capabilities at design time. We study an alternative setting, earned autonomy, in which a coding agent starts with zero pre-defined functions and incrementally builds a reusable function library through lightweight human feedback on visual output alone. We evaluate this setup in a Blender-based 3D scene generation task requiring both spatial reasoning and programmatic geometric control. Although the agent rediscovered core utility functions comparable to a human reference implementation, it achieved 0% full-scene success under output-only feedback across multiple instruction granularities, where success required satisfying object completeness, ground contact, collision avoidance, and scale plausibility simultaneously. Our analysis identifies a structural observability gap: bugs originate in code logic and execution state, while human evaluation occurs only at the output layer, and the many-to-one mapping from internal states to visible outcomes prevents symptom-level feedback from reliably identifying root causes. This mismatch leads to persistent failure mode oscillation rather than convergence. A diagnostic intervention that injected minimal code-level knowledge restored convergence, strongly supporting the interpretation that the main bottleneck lies in feedback observability rather than programming competence. We formalize this phenomenon as a feedback paradox in domains with deep causal chains between internal code logic and perceptual outcomes, and argue that effective human-agent collaboration in such settings requires intermediate observability beyond output-only evaluation.

Paper Structure

This paper contains 9 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of the generate–evaluate–evolve loop in the earned autonomy setting. In each cycle, a coding agent generates code, an execution agent runs the script and renders the scene, a human evaluator provides feedback on rendered output only, and a review agent promotes validated functions into the reusable library for the next cycle. The example illustrates library growth from 0 validated functions in Cycle 1 to 15 in Cycle 2.
  • Figure 2: Failure-mode oscillation across three consecutive runs in Group C. The task specifies BVH-based collision detection against “all mesh objects in the scene”. Left: the ground plane is included in the collision list, so every candidate placement intersects at z=0, causing timeout before trees or cars are placed; from rendered output alone, this appears as missing objects. Center: after feedback on missing objects, the agent restructures import and placement logic and places the trees and cars, but now excludes the house from collision checks, producing overlap. Right: feedback on overlap causes the agent to re-include all meshes in collision checks, re-triggering the original timeout failure.
  • Figure 3: Example successful scene generated after the diagnostic intervention in Group D. With the ground plane excluded from collision checks, the system can satisfy object completeness, ground contact, collision avoidance, and scale plausibility simultaneously.