Table of Contents
Fetching ...

Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, Buru Chang

TL;DR

This paper tackles the problem that multimodal LLMs (MLLMs) struggle to interpret object orientation due to inconsistent training annotations. It introduces Egocentric Instruction Tuning, which aligns orientation understanding with the user’s egocentric perspective by creating a consistent eight-class annotation scheme and generating LLaVA-style instruction data with three complementary response types. Complementing this method, EgoOrientBench provides a large-scale, cross-domain benchmark across three tasks to evaluate orientation understanding. Experimental results show that egocentric instruction tuning significantly improves orientation comprehension while preserving overall MLLM performance, and ablation studies reveal the contribution and synergy of each data type. The work demonstrates practical benefits for real-world applications such as pedestrian direction prediction and spatial reasoning, advancing safer and more user-aligned multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user's perspective, based on a consistent annotation standard derived from the user's egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs' ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model's capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs' orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at https://github.com/jhCOR/EgoOrientBench.

Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

TL;DR

This paper tackles the problem that multimodal LLMs (MLLMs) struggle to interpret object orientation due to inconsistent training annotations. It introduces Egocentric Instruction Tuning, which aligns orientation understanding with the user’s egocentric perspective by creating a consistent eight-class annotation scheme and generating LLaVA-style instruction data with three complementary response types. Complementing this method, EgoOrientBench provides a large-scale, cross-domain benchmark across three tasks to evaluate orientation understanding. Experimental results show that egocentric instruction tuning significantly improves orientation comprehension while preserving overall MLLM performance, and ablation studies reveal the contribution and synergy of each data type. The work demonstrates practical benefits for real-world applications such as pedestrian direction prediction and spatial reasoning, advancing safer and more user-aligned multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user's perspective, based on a consistent annotation standard derived from the user's egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs' ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model's capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs' orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at https://github.com/jhCOR/EgoOrientBench.

Paper Structure

This paper contains 28 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Examples of LLaVA's liu2024visual responses to prompts asking about the orientation of objects in an image. Current MLLMs show a significant lack of understanding when interpreting the orientation of given objects.
  • Figure 2: Examples of inconsistent annotations for object orientation. In image-text pairs used for training MLLMs, such as those in MSCOCO lin2014microsoft and LAION-5B schuhmann2022laion, annotations for object orientation lack consistency. For instance, (a) objects facing different orientations may be annotated as facing the same orientation, or conversely, (b) objects facing the same orientation may be annotated differently. This variation arises because, without a standardized guideline for object orientation, annotations can vary depending on individual perspectives or cultural backgrounds levinson1996frames. Our study aims to improve MLLM’s understanding of object orientation by aligning it with the user's egocentric perspective through instruction tuning.
  • Figure 3: An example of egocentric instruction data designed to enhance MLLMs' understanding of object orientation. This data leverages the model's intrinsic ability to recognize object details (Response Type 1) and the LLM's prior knowledge to link these details to specific orientations (Response Type 2). Furthermore, by engaging in object alignment tasks that require understanding the relationships between different orientations (Response Type 3), MLLMs’ comprehension of object orientation is further improved.
  • Figure 4: Our benchmark data examples. Our benchmark consists of data collected from various image domains to assess the applicability of MLLMs in terms of orientation understanding. The collected data is annotated across eight orientation classes.
  • Figure 5: Confusion matrix for the Choose task with LLaVA and mPLUG-Owl2. Zero-shot MLLMs show extreme bias toward the Front class (F) (or the Front Right (FR) class). Our proposed egocentric instruction tuning relieves this bias and enhances the understanding of object orientation.
  • ...and 3 more figures