Table of Contents
Fetching ...

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

TL;DR

RotBench systematically probes Multimodal LLMs for their ability to identify image rotation across 0°, 90°, 180°, and 270° using a 350-image, human-filtered benchmark derived from Spatial-MM. The study reveals strong upright (0°) and some upside-down (180°) capabilities but persistent confusion between 90° and 270°, with auxiliary data, CoT prompting, and rotation grids offering limited, inconsistent gains. Rotation-grid and majority voting strategies can help weaker models, though at the cost of extra compute and requiring prior knowledge of rotation options. Fine-tuning helps 180° accuracy but induces oscillations between 90° and 270°, suggesting two local optima and fundamental limitations in current visual encoders for rotational reasoning. Overall, the results highlight a notable gap between MLLMs and human perception in spatial orientation, motivating future rotation-aware training and evaluation paradigms.

Abstract

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

TL;DR

RotBench systematically probes Multimodal LLMs for their ability to identify image rotation across 0°, 90°, 180°, and 270° using a 350-image, human-filtered benchmark derived from Spatial-MM. The study reveals strong upright (0°) and some upside-down (180°) capabilities but persistent confusion between 90° and 270°, with auxiliary data, CoT prompting, and rotation grids offering limited, inconsistent gains. Rotation-grid and majority voting strategies can help weaker models, though at the cost of extra compute and requiring prior knowledge of rotation options. Fine-tuning helps 180° accuracy but induces oscillations between 90° and 270°, suggesting two local optima and fundamental limitations in current visual encoders for rotational reasoning. Overall, the results highlight a notable gap between MLLMs and human perception in spatial orientation, motivating future rotation-aware training and evaluation paradigms.

Abstract

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

Paper Structure

This paper contains 43 sections, 1 equation, 17 figures, 14 tables, 1 algorithm.

Figures (17)

  • Figure 1: We present two RotBench images: one (left) to Gemini-2.5-Pro, the other (right) to GPT-5. Humans can easily identify the correct rotation of the two images, but both models fail to do so.
  • Figure 2: RotBench evaluation pipeline: for each image in RotBench, we rotate the image 0°, 90°, 180°, and 270° counter-clockwise. We represent the rotation estimation problem as a multiple-choice question answering problem (\ref{['sec:app_prompts']}), and separately measure accuracy on each image orientation. We optionally provide different forms of auxiliary information to aid the model in identifying image rotation. We emphasize that all forms of auxiliary information are separately extracted for each rotation; the ground truth rotation is not marked.
  • Figure 3: Confusion matrix of true vs. predicted rotations for GPT-4o using CoT prompting, summed across three runs on RotBench-large. Rows represent ground-truth labels, columns represent predicted labels. The matrix highlights a significant confusion specifically between 90° and 270° rotations.
  • Figure 4: GPT-4o answers incorrectly when asked to identify whether the image has been rotated 90° clockwise or counter-clockwise.
  • Figure 5: Qwen-2.5-VL-7B-Instruct's accuracy on different degrees of rotation as training progresses.
  • ...and 12 more figures