Table of Contents
Fetching ...

Language-in-the-Loop Culvert Inspection on the Erie Canal

Yash Turkar, Yashom Dighe, Karthik Dantu

TL;DR

This work addresses the safety and feasibility of inspecting aging culverts by introducing VISION, an onboard language-in-the-loop autonomous system that combines a web-scale vision-language model with constrained viewpoint planning. The approach uses open-vocabulary ROI proposals enriched with stereo depth to estimate 3D ROI positions and applies a two-stage viewpoint optimization under culvert geometry, implemented on a quadruped platform with a pan-tilt camera system. Field experiments beneath the Erie Canal show that VISION achieves reasonable alignment with expert judgments (initial ROI agreement 61.4% and final ground-truth alignment 80%), while dramatically reducing inspection time to under 90 minutes for a 66 m culvert compared to typical human cycles. Limitations include reliance on networked VLM access and a small sample, motivating future work toward on-board, fully autonomous VLMs, larger-scale studies, and automated inspection reporting.

Abstract

Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner -- aware of culvert constraints -- commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4\% agreement with subject-matter experts, and final post-re-imaging assessments reached 80\%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.

Language-in-the-Loop Culvert Inspection on the Erie Canal

TL;DR

This work addresses the safety and feasibility of inspecting aging culverts by introducing VISION, an onboard language-in-the-loop autonomous system that combines a web-scale vision-language model with constrained viewpoint planning. The approach uses open-vocabulary ROI proposals enriched with stereo depth to estimate 3D ROI positions and applies a two-stage viewpoint optimization under culvert geometry, implemented on a quadruped platform with a pan-tilt camera system. Field experiments beneath the Erie Canal show that VISION achieves reasonable alignment with expert judgments (initial ROI agreement 61.4% and final ground-truth alignment 80%), while dramatically reducing inspection time to under 90 minutes for a 66 m culvert compared to typical human cycles. Limitations include reliance on networked VLM access and a small sample, motivating future work toward on-board, fully autonomous VLMs, larger-scale studies, and automated inspection reporting.

Abstract

Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner -- aware of culvert constraints -- commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4\% agreement with subject-matter experts, and final post-re-imaging assessments reached 80\%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.

Paper Structure

This paper contains 15 sections, 13 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Sample from an official Inspection report by NYCC
  • Figure 2: Qualitative comparison across degradations and baselines. Left: input culvert image and our VLM with a sparse prompt producing ROI proposals (red boxes with reasons) and calibrated follow-up probabilities (shown in parentheses; sum to 1.00). Right: results for three degradation modes—rust, ice/debris, and scaling—under three open-vocabulary baselines: Lang-SAM, Grounding DINO, and Grounding SAM. Lang-SAM often over-segments large portions of the barrel; Grounding DINO yields coarse boxes; Grounding SAM converts those boxes to masks but still spreads labels broadly. Our method localizes discrete issues (e.g., rust staining at corrugations, ice/debris on the invert, mineral scaling near the crown) to guide targeted re-imaging.
  • Figure 3: Overview of the VISION inspection pipeline. At each global waypoint, a query image is analyzed to extract ROIs and solve next-best viewpoints. Callout: for each ROI, the planner executes a local waypoint, commands the pan–tilt gimbal angles, and captures the inspection image. Then, moves on to the next waypoint
  • Figure 4: Coordinate frames used in our system: a culvert-fixed world frame$\{x_\mathcal{S},y_\mathcal{S},z_\mathcal{S}\}$ and the gimbal $\{x_\mathcal{G},y_\mathcal{G},z_\mathcal{G}\}$ and camera $\{x_\mathcal{C},y_\mathcal{C},z_\mathcal{C}\}$ frames mounted on the legged robot; axes colored x=red, y=green, z=blue. Gimbal axes are aligned with the world only in neutral position
  • Figure 5: VLM ROI proposals on a query image frame (ROIs 1,2,3) with rationales and descriptions