ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
TL;DR
This work presents ComboBench, a benchmark to evaluate whether large language models can translate high-level semantic goals in VR games into fine-grained, device-level manipulations. By annotating 262 semantic actions across four VR titles and employing a six-dimension cognitive capability taxonomy, the study comprehensively assesses multiple LLMs (including GPT-4o, GPT-4-turbo, GPT-3.5-turbo, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B) against ground-truth sequences and human performance. Results show strong task decomposition across models but substantial gaps in motor action mapping and procedural reasoning, with few-shot demonstrations substantially boosting temporal sequencing (SOP) while leaving exact step matching (SSM) relatively limited. The findings highlight the need for multimodal, embodied training and more nuanced evaluation to approach human-level VR manipulation, and they point toward directions for future embodied AI research in virtual environments. Overall, ComboBench provides a nuanced, multi-metric framework for diagnosing where current LLMs succeed or fail in grounded VR interaction and motivates integrating spatial, visual, and proprioceptive information into model training.
Abstract
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
