Table of Contents
Fetching ...

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou

TL;DR

ReLook addresses the challenge of front-end code generation by introducing a vision-grounded reinforcement learning framework that closes a generate–diagnose–refine loop via a multimodal LLM critic. It uses zero rewards for invalid renders and a Forced Optimization strategy to ensure monotonically improving trajectories, enabling critic-free fast inference at test time. Across ArtifactsBench, FullStack-Bench-Html, and Web-Bench, ReLook significantly outperforms strong baselines, with ablations confirming the critical role of vision-based rewards and monotonic refinement. The approach highlights the practical value of perception-aware feedback in perceptual programming and shows promise for extending to other visual- and interaction-focused coding tasks.

Abstract

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

TL;DR

ReLook addresses the challenge of front-end code generation by introducing a vision-grounded reinforcement learning framework that closes a generate–diagnose–refine loop via a multimodal LLM critic. It uses zero rewards for invalid renders and a Forced Optimization strategy to ensure monotonically improving trajectories, enabling critic-free fast inference at test time. Across ArtifactsBench, FullStack-Bench-Html, and Web-Bench, ReLook significantly outperforms strong baselines, with ablations confirming the critical role of vision-based rewards and monotonic refinement. The approach highlights the practical value of perception-aware feedback in perceptual programming and shows promise for extending to other visual- and interaction-focused coding tasks.

Abstract

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

Paper Structure

This paper contains 54 sections, 8 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of ReLook. Left: training closes a generate–diagnose–refine cycle: policy LLM generates code, pages rendered to temporal screenshots, and a vision-aware critic (MLLM) provides scores and feedback. Rewards combine visual scoring and format constraints; the policy is optimized with GRPO. Right: at inference the model runs a lightweight Re-Look cycle — external critic may be omitted for latency or used for higher accuracy.
  • Figure 2: Radar plot showing ReLook's consistent improvements across all ArtifactsBench subsets for both Qwen2.5-7B and Llama-3.1-8B backbones (averaged over 3 seeds).
  • Figure 3: Performance on ArtifactsBench-Lite showing consistent ordering: ReLook $>$ Web-RL $>$ Base Model. Results averaged over 3 seeds.
  • Figure 4: Behavioral collapse mitigation. Base model (Qwen2.5-7B-Instruct) degrades after initial attempts despite MLLM feedback, while ReLook exhibits monotonic improvement across eight forced reflection rounds on ArtifactsBench-Lite. Scores from training-time judge (Qwen2.5-VL-72B).
  • Figure 5: Intermediate Results of RL Training. The figure shows the average reward score on the validation set and the number of optimization steps during inference for our training of Relook using Qwen2.5-Instruct-7B as the base model.
  • ...and 2 more figures