Detailed Object Description with Controllable Dimensions
Xinran Wang, Haiwen Zhang, Baoteng Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo
TL;DR
This work tackles the problem of aligning object descriptions generated by multimodal LLMs with user-specified object dimensions. It introduces Dimension Tailor (DT), a training-free pipeline consisting of dimension extracting, erasing, and supplementing that refines descriptions to emphasize targeted dimensions while removing irrelevant details. To evaluate controllability and validity, the authors propose three metrics—mean Dimensional Recall ($mDR$), mean Dimensional Precision ($mDP$), and mean Dimensional F1 ($mDF1$)—and adopt a GPT-4-based evaluation against expert references. Experiments on open-source MLLMs and a commercial baseline (GPT-4o) show that DT consistently improves dimensional controllability, often bridging the gap to commercial systems, and reveal dimensional biases linked to training data. The approach provides a low-cost, scalable method to produce concise, user-aligned object descriptions, with broad implications for accessibility and human–AI interaction.
Abstract
Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models(MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric descriptions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent or miss some important object dimension details. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free object description refinement pipeline, Dimension Tailor, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into user-specified dimensions. Dimension Tailor can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at https://github.com/xin-ran-w/ControllableObjectDescription.
