RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
TL;DR
RotBench systematically probes Multimodal LLMs for their ability to identify image rotation across 0°, 90°, 180°, and 270° using a 350-image, human-filtered benchmark derived from Spatial-MM. The study reveals strong upright (0°) and some upside-down (180°) capabilities but persistent confusion between 90° and 270°, with auxiliary data, CoT prompting, and rotation grids offering limited, inconsistent gains. Rotation-grid and majority voting strategies can help weaker models, though at the cost of extra compute and requiring prior knowledge of rotation options. Fine-tuning helps 180° accuracy but induces oscillations between 90° and 270°, suggesting two local optima and fundamental limitations in current visual encoders for rotational reasoning. Overall, the results highlight a notable gap between MLLMs and human perception in spatial orientation, motivating future rotation-aware training and evaluation paradigms.
Abstract
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
