Table of Contents
Fetching ...

Detailed Object Description with Controllable Dimensions

Xinran Wang, Haiwen Zhang, Baoteng Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo

TL;DR

This work tackles the problem of aligning object descriptions generated by multimodal LLMs with user-specified object dimensions. It introduces Dimension Tailor (DT), a training-free pipeline consisting of dimension extracting, erasing, and supplementing that refines descriptions to emphasize targeted dimensions while removing irrelevant details. To evaluate controllability and validity, the authors propose three metrics—mean Dimensional Recall ($mDR$), mean Dimensional Precision ($mDP$), and mean Dimensional F1 ($mDF1$)—and adopt a GPT-4-based evaluation against expert references. Experiments on open-source MLLMs and a commercial baseline (GPT-4o) show that DT consistently improves dimensional controllability, often bridging the gap to commercial systems, and reveal dimensional biases linked to training data. The approach provides a low-cost, scalable method to produce concise, user-aligned object descriptions, with broad implications for accessibility and human–AI interaction.

Abstract

Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models(MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric descriptions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent or miss some important object dimension details. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free object description refinement pipeline, Dimension Tailor, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into user-specified dimensions. Dimension Tailor can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at https://github.com/xin-ran-w/ControllableObjectDescription.

Detailed Object Description with Controllable Dimensions

TL;DR

This work tackles the problem of aligning object descriptions generated by multimodal LLMs with user-specified object dimensions. It introduces Dimension Tailor (DT), a training-free pipeline consisting of dimension extracting, erasing, and supplementing that refines descriptions to emphasize targeted dimensions while removing irrelevant details. To evaluate controllability and validity, the authors propose three metrics—mean Dimensional Recall (), mean Dimensional Precision (), and mean Dimensional F1 ()—and adopt a GPT-4-based evaluation against expert references. Experiments on open-source MLLMs and a commercial baseline (GPT-4o) show that DT consistently improves dimensional controllability, often bridging the gap to commercial systems, and reveal dimensional biases linked to training data. The approach provides a low-cost, scalable method to produce concise, user-aligned object descriptions, with broad implications for accessibility and human–AI interaction.

Abstract

Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models(MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric descriptions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent or miss some important object dimension details. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free object description refinement pipeline, Dimension Tailor, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into user-specified dimensions. Dimension Tailor can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at https://github.com/xin-ran-w/ControllableObjectDescription.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Dimensional controllable object description. In real-world scenarios, users need descriptions of objects focusing on specific dimensions of interest. However, existing multimodal large models often overlook dimension information user needs or include irrelevant information, resulting in descriptions that do not align with user preferences. By applying Dimension Tailor to MLLM-generated detailed object descriptions, the refined descriptions are more aligned with the user-specified dimension, reducing redundancy and focusing on the desired object dimensions.
  • Figure 2: The diagram of our description refinement pipeline, Dimension Tailor. The top half shows the full flow of the Dimension Tailor. The bottom half shows the detailed diagrams of the three key steps in Dimension Tailor. $\mathcal{U}$ is the user-required dimensions and $\mathcal{U}^*$ is the dimensions contained in MLLM generated description.
  • Figure 3: Controllability evaluation results of all open-source MLLMs.
  • Figure 4: The DR and frequency in LLaVAv1.5 instruction tuning dataset of several common object-dimension combinations. The DR and frequency of different object-dimension combinations are positively correlated.
  • Figure 5: Visualization of controllable object descriptions generated by multiple MLLMs. $|\tilde{\mathcal{U}}|$ represent the cover number of user-specified dimensions, where $\tilde{\mathcal{U}}$ = $\mathcal{U}^* \cap \mathcal{U}$. The underlined texts highlight unintended dimensions in the descriptions.
  • ...and 3 more figures