Spot The Ball: A Benchmark for Visual Social Inference
Neha Balamurugan, Sarah Wu, Adam Chun, Gabe Gaw, Cristobal Eyzaguirre, Tobias Gerstenberg
TL;DR
Spot the Ball introduces a visually grounded social inference benchmark to evaluate vision-language models on inferring hidden objects from cues like gaze and pose in sports images. The authors assemble a 150-image evaluation set and a scalable pipeline that generates thousands of additional items, enabling broad multi-model evaluation across three ball sports. Four VLMs are tested under Base, Cue-Directed, and Chain-of-Thought prompts against human baselines, revealing a substantial human–model gap and showing that models rely on simple cues such as center bias while humans leverage richer social signals. The findings highlight the need for architectures that explicitly encode social-behavior priors and relational dynamics for robust, human-like visual reasoning in real-world scenes.
Abstract
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
