Table of Contents
Fetching ...

Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

Yik Lung Pang, Changjae Oh

TL;DR

This work tackles referring expression comprehension (REC) with multimodal LLMs by systematically studying textual and visual contexts provided through tool use. It introduces Chain-of-Caption, a training-free framework that grounds textual concepts with bounding boxes, crops image regions, and leverages VQA and captioning to iteratively refine predictions, without fine-tuning. Experiments on RefCOCO, RefCOCO+, RefCOCOg, and Ref-L4 show that grounding descriptions yields the strongest gains at high IoU thresholds, and that combining textual grounding with visual refinement substantially enhances localization, achieving competitive results across model sizes. The approach demonstrates practical improvements in REC and highlights the value of in-context, training-free reasoning for grounding tasks in MLLMs.

Abstract

Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.

Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

TL;DR

This work tackles referring expression comprehension (REC) with multimodal LLMs by systematically studying textual and visual contexts provided through tool use. It introduces Chain-of-Caption, a training-free framework that grounds textual concepts with bounding boxes, crops image regions, and leverages VQA and captioning to iteratively refine predictions, without fine-tuning. Experiments on RefCOCO, RefCOCO+, RefCOCOg, and Ref-L4 show that grounding descriptions yields the strongest gains at high IoU thresholds, and that combining textual grounding with visual refinement substantially enhances localization, achieving competitive results across model sizes. The approach demonstrates practical improvements in REC and highlights the value of in-context, training-free reasoning for grounding tasks in MLLMs.

Abstract

Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.
Paper Structure (10 sections, 6 equations, 5 figures, 2 tables)

This paper contains 10 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We leverage the multi-task capability of MLLMs, e.g. grounded description generation, referring expression comprehension (REC), visual question answering (VQA), and captioning, to propose a training-free framework for improving REC performance.
  • Figure 2: Example textual and visual contexts for referring expression comprehension. Bounding boxes are in normalised coordinates in the format [top-left x, top-left y, bottom-right x, bottom-right y].
  • Figure 3: Our proposed training-free framework for the task of referring expression comprehension. We first initialise the grounded description using the MLLM model. We then refine the predicted bounding box using the multitask capabilities of the MLLM.
  • Figure 4: Accuracy improves on RefCOCO when including additional objects in the grounded description as context. Legend: Acc0.7, Acc0.9, w/o grounded description, w/ grounded description.
  • Figure 5: Chain-of-caption refines the predicted bounding box of the base model. Predicted bounding box, Groundtruth bounding box