Table of Contents
Fetching ...

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Rishikesh Bhyri, Brian R Quaranto, Philip J Seger, Kaity Tung, Brendan Fox, Gene Yang, Steven D. Schwaitzberg, Junsong Yuan, Nan Xi, Peter C W Kim

TL;DR

This work tackles the challenge of counting densely packed surgical instruments by introducing Chain-of-Look Spatial Reasoning (CoLSR), which imposes a structured visual counting chain and spatial constraints to mimic human sequential counting. A Visual Chain Generator, augmented with class-specific prompts, and a neighboring loss enforce coherent spatial ordering, improving robustness in high-density scenes. The authors introduce SurgCount-HD, a 1,464-image dataset of densely arranged instrument handles, and demonstrate that CoLSR surpasses state-of-the-art counting methods and multimodal LLMs in both accuracy (MAE≈0.88, RMSE≈1.27) and speed (real-time mobile inference). Combined ablations, analysis, and extended evaluations show the approach benefits from CSL prompts, visual exemplars, and the neighboring loss, with potential applicability to other dense-object counting tasks in medical and industrial settings.

Abstract

Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

TL;DR

This work tackles the challenge of counting densely packed surgical instruments by introducing Chain-of-Look Spatial Reasoning (CoLSR), which imposes a structured visual counting chain and spatial constraints to mimic human sequential counting. A Visual Chain Generator, augmented with class-specific prompts, and a neighboring loss enforce coherent spatial ordering, improving robustness in high-density scenes. The authors introduce SurgCount-HD, a 1,464-image dataset of densely arranged instrument handles, and demonstrate that CoLSR surpasses state-of-the-art counting methods and multimodal LLMs in both accuracy (MAE≈0.88, RMSE≈1.27) and speed (real-time mobile inference). Combined ablations, analysis, and extended evaluations show the approach benefits from CSL prompts, visual exemplars, and the neighboring loss, with potential applicability to other dense-object counting tasks in medical and industrial settings.

Abstract

Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
Paper Structure (41 sections, 13 equations, 16 figures, 8 tables, 1 algorithm)

This paper contains 41 sections, 13 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: High-density surgical instrument counting. Counting surgical instruments reliably in high density scenarios is challenging due to severe visual clutter and tight spatial packing of objects. To improve robustness, we propose Chain-of-Look Spatial Reasoning to introduce visual chains into the counting process, explicitly modeling the sequential characteristic of human visual counting. In the above figure, the first column indicates original high-density surgical instrument images, the second column presents visual chains and the third column shows the predicted counting results, where detected surgical instrument handles are highlighted with laser points.
  • Figure 2: (A) Representative images from the SurgCount-HD dataset. Sample images from the dataset, showing typical variations and an example annotation. (B) Test result from GPT5. We evaluate GPT-5 on an example from our SurgCount-HD dataset, where detected surgical instruments are highlighted with red dots. GPT-5 predicts a count of 84, whereas the ground truth is 57.
  • Figure 3: Architecture of Chain-of-Look Spatial Reasoning framework. High density surgical instrument images are first fed into visual chain generator to produce visual chains. Neighboring loss is further applied to guide the counting process following the visual chain.
  • Figure 4: Visual Chain Generator and Neighboring loss function. (a) Detailed architecture of Visual Chain Generator; (b) Neighboring loss and Distance loss. Detailed illustrations on the architecture can be found in Section \ref{['sec:arch']}.
  • Figure 5: Qualitative results. We present qualitative results from our CoLSR. Predicted surgical instruments number and ground-truth number are listed on each image. The detected surgical instrument handles are highlighted with laser points, which are also highlighted with red bounding boxes.
  • ...and 11 more figures