Table of Contents
Fetching ...

GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation

Wei Zeng, Fengwei An, Zhen Liu, Jian Zhao

Abstract

Game UI design requires consistent visual assets across rarity tiers yet remains a predominantly manual process. We present GameUIAgent, an LLM-powered agentic framework that translates natural language descriptions into editable Figma designs via a Design Spec JSON intermediate representation. A six-stage neuro-symbolic pipeline combines LLM generation, deterministic post-processing, and a Vision-Language Model (VLM)-guided Reflection Controller (RC) for iterative self-correction with guaranteed non-regressive quality. Evaluated across 110 test cases, three LLMs, and three UI templates, cross-model analysis establishes a game-domain failure taxonomy (rarity-dependent degradation; visual emptiness) and uncovers two key empirical findings. A Quality Ceiling Effect (Pearson r=-0.96, p<0.01) suggests that RC improvement is bounded by headroom below a quality threshold -- a visual-domain counterpart to test-time compute scaling laws. A Rendering-Evaluation Fidelity Principle reveals that partial rendering enhancements paradoxically degrade VLM evaluation by amplifying structural defects. Together, these results establish foundational principles for LLM-driven visual generation agents in game production.

GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation

Abstract

Game UI design requires consistent visual assets across rarity tiers yet remains a predominantly manual process. We present GameUIAgent, an LLM-powered agentic framework that translates natural language descriptions into editable Figma designs via a Design Spec JSON intermediate representation. A six-stage neuro-symbolic pipeline combines LLM generation, deterministic post-processing, and a Vision-Language Model (VLM)-guided Reflection Controller (RC) for iterative self-correction with guaranteed non-regressive quality. Evaluated across 110 test cases, three LLMs, and three UI templates, cross-model analysis establishes a game-domain failure taxonomy (rarity-dependent degradation; visual emptiness) and uncovers two key empirical findings. A Quality Ceiling Effect (Pearson r=-0.96, p<0.01) suggests that RC improvement is bounded by headroom below a quality threshold -- a visual-domain counterpart to test-time compute scaling laws. A Rendering-Evaluation Fidelity Principle reveals that partial rendering enhancements paradoxically degrade VLM evaluation by amplifying structural defects. Together, these results establish foundational principles for LLM-driven visual generation agents in game production.
Paper Structure (16 sections, 6 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall architecture of GameUIAgent. Stages 1--5 constitute the forward generation pass; Stage 6 (Reflection Controller, green) closes the agentic loop by converting VLM quality scores into targeted repair prompts and re-invoking Stage 2 if $S_{\text{avg}}\!<\!\theta$. The Design Spec JSON intermediate representation (dashed box) separates creative generation from deterministic enhancement.
  • Figure 2: Rarity progression (N to UR) generated by GameUIAgent with DeepSeek V3. Progressively richer visual treatments are injected automatically: simple gray borders (N) $\to$ colored theme borders (R) $\to$ gradient fills with glow (SR) $\to$ multi-layered borders with star badges (SSR) $\to$ golden ornate frames (UR). All five cards share the Fire element theme, demonstrating consistent thematic adaptation across the rarity hierarchy.
  • Figure 3: Cross-model analysis. (A) Failure cases (left panel): (a) Gemini UR-tier degradation ($V\!=\!60\%$); (b--c) GPT-4o-mini visual emptiness on Item Thumbnail and Skill Panel; (d) DeepSeek V3 UR reference. (B) R-tier comparison (right panel, enlarged for clarity): DeepSeek V3 (left, $S\!=\!8.0$) produces coherent themed borders, readable color-coded stats, and a well-structured portrait region; GPT-4o-mini (right, $S\!=\!6.7$) yields a syntactically valid design but with low-contrast text, blank portrait region, and element overlap.
  • Figure 4: Rendering-Evaluation Fidelity: the Gradient renderer decreases VLM score by $-1.00$ ($p\!=\!0.0017$, $d\!=\!-0.81$) relative to the Flat baseline; the Layout-aware renderer recovers and surpasses it ($\Delta\!=\!+2.52$, $p\!<\!0.001$, $d\!=\!2.74$). $n\!=\!27$ Skill Panel designs evaluated across three renderer configurations with identical Design Spec JSONs.
  • Figure 5: Quality Ceiling Effect: Pearson $r\!=\!-0.96$ ($p\!<\!0.01$) between mean initial VLM score and RC improvement ($\Delta$) across five independent conditions ($n_{\text{total}}\!=\!93$; DeepSeek V3 and GPT-4o-mini; five tier mixes). RC gain is universally bounded by headroom below $\theta$, not by model capability or rarity complexity.
  • ...and 1 more figures