Language-in-the-Loop Culvert Inspection on the Erie Canal
Yash Turkar, Yashom Dighe, Karthik Dantu
TL;DR
This work addresses the safety and feasibility of inspecting aging culverts by introducing VISION, an onboard language-in-the-loop autonomous system that combines a web-scale vision-language model with constrained viewpoint planning. The approach uses open-vocabulary ROI proposals enriched with stereo depth to estimate 3D ROI positions and applies a two-stage viewpoint optimization under culvert geometry, implemented on a quadruped platform with a pan-tilt camera system. Field experiments beneath the Erie Canal show that VISION achieves reasonable alignment with expert judgments (initial ROI agreement 61.4% and final ground-truth alignment 80%), while dramatically reducing inspection time to under 90 minutes for a 66 m culvert compared to typical human cycles. Limitations include reliance on networked VLM access and a small sample, motivating future work toward on-board, fully autonomous VLMs, larger-scale studies, and automated inspection reporting.
Abstract
Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner -- aware of culvert constraints -- commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4\% agreement with subject-matter experts, and final post-re-imaging assessments reached 80\%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.
