Table of Contents
Fetching ...

Spot The Ball: A Benchmark for Visual Social Inference

Neha Balamurugan, Sarah Wu, Adam Chun, Gabe Gaw, Cristobal Eyzaguirre, Tobias Gerstenberg

TL;DR

Spot the Ball introduces a visually grounded social inference benchmark to evaluate vision-language models on inferring hidden objects from cues like gaze and pose in sports images. The authors assemble a 150-image evaluation set and a scalable pipeline that generates thousands of additional items, enabling broad multi-model evaluation across three ball sports. Four VLMs are tested under Base, Cue-Directed, and Chain-of-Thought prompts against human baselines, revealing a substantial human–model gap and showing that models rely on simple cues such as center bias while humans leverage richer social signals. The findings highlight the need for architectures that explicitly encode social-behavior priors and relational dynamics for robust, human-like visual reasoning in real-world scenes.

Abstract

Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Spot The Ball: A Benchmark for Visual Social Inference

TL;DR

Spot the Ball introduces a visually grounded social inference benchmark to evaluate vision-language models on inferring hidden objects from cues like gaze and pose in sports images. The authors assemble a 150-image evaluation set and a scalable pipeline that generates thousands of additional items, enabling broad multi-model evaluation across three ball sports. Four VLMs are tested under Base, Cue-Directed, and Chain-of-Thought prompts against human baselines, revealing a substantial human–model gap and showing that models rely on simple cues such as center bias while humans leverage richer social signals. The findings highlight the need for architectures that explicitly encode social-behavior priors and relational dynamics for robust, human-like visual reasoning in real-world scenes.

Abstract

Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ( 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Paper Structure

This paper contains 39 sections, 11 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview of the Spot the Ball task. Given an image with the ball removed, humans and models infer the likely location by reasoning about player pose and gaze. Models are prompted under three conditions, whereas humans receive only the base prompt.
  • Figure 2: Pipeline for constructing the Spot the Ball dataset. We retrieve and filter sports footage from YouTube by alignment to the prompts, detect players and balls using an object detector, and inpaint the ball region with stable diffusion before overlaying a 6×10 grid for location annotation.
  • Figure 3: Player density and coverage across sports in the Spot the Ball dataset. In Soccer frames (A) player count and area are intermediate compared to the other sports. Volleyball frames (B) feature the most players but each occupies a smaller visual area, providing weaker pose and gaze cues. Basketball (C) has fewer yet larger players, offering clearer postural and gaze information.
  • Figure 4: Accuracy. Model accuracy in each sport under different prompting strategies (blue $=$ base prompt, green $=$ cue-directed prompt, red $=$ chain-of-thought prompt). The dashed line shows human accuracy using the base prompt in the given sport. Error bars and gray ribbon show 95% bootstrapped confidence intervals.
  • Figure 5: Player Proximity Analysis. Each point corresponds to a model–sport combination. The $x$-axis shows the fraction of guesses within a fixed distance threshold of any player (Near Player Rate), while the $y$-axis shows the fraction of guesses whose predicted cell overlaps a player bounding box (Near Overlap Rate). 52.2% of ground truth balls are near players, 20.9% of the ground truth balls are near players by overlap.
  • ...and 6 more figures