Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

Muhammad Yousaf; Aitor Arrieta; Shaukat Ali; Paolo Arcaini; Shuai Wang

Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

Muhammad Yousaf, Aitor Arrieta, Shaukat Ali, Paolo Arcaini, Shuai Wang

TL;DR

This work evaluates four open-source Vision-Language Models (VLMs) for perception in autonomous underwater robots, focusing on underwater trash detection under challenging visibility. It uses a zero-shot, single-prompt approach to assess multi-label classification and token-level uncertainty across two datasets, TrashCan1.0 and SeaClear, and investigates the relationship between performance and uncertainty via metrics such as $F1$, $Jaccard$, $MSP$, $PCS$, $ENT$, $DG$, $ECE$, and $MCE$. The key finding is that BLIP offers the best calibration and competitive accuracy, while LLaVA is overconfident yet underperforming, and Vegetation/Animal classes remain difficult for all models; overall, VLMs should be used as supporting components in AUR software rather than standalone perception systems. The study emphasizes uncertainty and calibration as critical factors for safety, trustworthiness, and regulatory compliance in maritime applications, and it suggests leveraging VLMs within digital twins and uncertainty-aware testing to improve reliability in real world deployments.

Abstract

Autonomous Underwater Robots (AURs) operate in challenging underwater environments, including low visibility and harsh water conditions. Such conditions present challenges for software engineers developing perception modules for the AUR software. To successfully carry out these tasks, deep learning has been incorporated into the AUR software to support its operations. However, the unique challenges of underwater environments pose difficulties for deep learning models, which often rely on labeled data that is scarce and noisy. This may undermine the trustworthiness of AUR software that relies on perception modules. Vision-Language Models (VLMs) offer promising solutions for AUR software as they generalize to unseen objects and remain robust in noisy conditions by inferring information from contextual cues. Despite this potential, their performance and uncertainty in underwater environments remain understudied from a software engineering perspective. Motivated by the needs of an industrial partner in assurance and risk management for maritime systems to assess the potential use of VLMs in this context, we present an empirical evaluation of VLM-based perception modules within the AUR software. We assess their ability to detect underwater trash by computing performance, uncertainty, and their relationship, to enable software engineers to select appropriate VLMs for their AUR software.

Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

TL;DR

, and

. The key finding is that BLIP offers the best calibration and competitive accuracy, while LLaVA is overconfident yet underperforming, and Vegetation/Animal classes remain difficult for all models; overall, VLMs should be used as supporting components in AUR software rather than standalone perception systems. The study emphasizes uncertainty and calibration as critical factors for safety, trustworthiness, and regulatory compliance in maritime applications, and it suggests leveraging VLMs within digital twins and uncertainty-aware testing to improve reliability in real world deployments.

Abstract

Paper Structure (28 sections, 10 equations, 7 figures, 7 tables)

This paper contains 28 sections, 10 equations, 7 figures, 7 tables.

Introduction
Industrial Context
Experiment Design
Overall Empirical Evaluation Setup
Research Questions
Benchmark Datasets
Experimental Setting
Subject VLMs:
Prompt Design:
Execution Environment:
Evaluation Metrics
RQ1 -- Evaluation Metrics for Performance
RQ2 -- Evaluation Metrics for Uncertainty Quantification
Confidence Metrics
Uncertainty Metrics:
...and 13 more sections

Figures (7)

Figure 1: Overview of the Study
Figure 2: RQ1 -- Comparison of overall performance metrics for the VLMs across datasets
Figure 3: RQ2 -- Uncertainty of the VLMs across datasets for each class. Values are normalized and scaled to the same range: higher values indicate higher confidence (MSP and PCS) and lower uncertainty (DG and ENT).
Figure 4: RQ3 -- Comparison of F1 (Micro) vs Uncertainty Metrics for four VLMs across Aggregated Dataset
Figure 5: RQ3 -- Comparison of F1 (Macro) vs Uncertainty Metrics for four VLMs across Aggregated Dataset
...and 2 more figures

Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

TL;DR

Abstract

Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

Authors

TL;DR

Abstract

Table of Contents

Figures (7)