ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
TL;DR
ReLook addresses the challenge of front-end code generation by introducing a vision-grounded reinforcement learning framework that closes a generate–diagnose–refine loop via a multimodal LLM critic. It uses zero rewards for invalid renders and a Forced Optimization strategy to ensure monotonically improving trajectories, enabling critic-free fast inference at test time. Across ArtifactsBench, FullStack-Bench-Html, and Web-Bench, ReLook significantly outperforms strong baselines, with ablations confirming the critical role of vision-based rewards and monotonic refinement. The approach highlights the practical value of perception-aware feedback in perceptual programming and shows promise for extending to other visual- and interaction-focused coding tasks.
Abstract
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
