Table of Contents
Fetching ...

Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation

Sepand Dyanatkar, Angran Li, Alexander Dungate

TL;DR

This work tackles the difficulty of applying vision-based monitoring to diverse, real-world marine environments where traditional models struggle with generalization and unseen species. It introduces a bottom-up, open-domain framework that fuses vision-language models (VLMs) with retrieval-augmented generation (RAG) and grounds predictions using an image-embedding vector store, demonstrated on on-board fish classification with the Fishnet dataset. The key contributions include a minimal visual RAG architecture leveraging image-embedding keys, empirical evidence that retrieval improves classification accuracy without task-specific training, and a roadmap for deploying scalable ocean monitoring to aid climate adaptation and conservation. The approach holds promise for rapid adaptation to new species and contexts, enabling more informed management of fisheries, invasive species, and ecosystem health under climate change.

Abstract

Climate change's destruction of marine biodiversity is threatening communities and economies around the world which rely on healthy oceans for their livelihoods. The challenge of applying computer vision to niche, real-world domains such as ocean conservation lies in the dynamic and diverse environments where traditional top-down learning struggle with long-tailed distributions, generalization, and domain transfer. Scalable species identification for ocean monitoring is particularly difficult due to the need to adapt models to new environments and identify rare or unseen species. To overcome these limitations, we propose leveraging bottom-up, open-domain learning frameworks as a resilient, scalable solution for image and video analysis in marine applications. Our preliminary demonstration uses pretrained vision-language models (VLMs) combined with retrieval-augmented generation (RAG) as grounding, leaving the door open for numerous architectural, training and engineering optimizations. We validate this approach through a preliminary application in classifying fish from video onboard fishing vessels, demonstrating impressive emergent retrieval and prediction capabilities without domain-specific training or knowledge of the task itself.

Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation

TL;DR

This work tackles the difficulty of applying vision-based monitoring to diverse, real-world marine environments where traditional models struggle with generalization and unseen species. It introduces a bottom-up, open-domain framework that fuses vision-language models (VLMs) with retrieval-augmented generation (RAG) and grounds predictions using an image-embedding vector store, demonstrated on on-board fish classification with the Fishnet dataset. The key contributions include a minimal visual RAG architecture leveraging image-embedding keys, empirical evidence that retrieval improves classification accuracy without task-specific training, and a roadmap for deploying scalable ocean monitoring to aid climate adaptation and conservation. The approach holds promise for rapid adaptation to new species and contexts, enabling more informed management of fisheries, invasive species, and ecosystem health under climate change.

Abstract

Climate change's destruction of marine biodiversity is threatening communities and economies around the world which rely on healthy oceans for their livelihoods. The challenge of applying computer vision to niche, real-world domains such as ocean conservation lies in the dynamic and diverse environments where traditional top-down learning struggle with long-tailed distributions, generalization, and domain transfer. Scalable species identification for ocean monitoring is particularly difficult due to the need to adapt models to new environments and identify rare or unseen species. To overcome these limitations, we propose leveraging bottom-up, open-domain learning frameworks as a resilient, scalable solution for image and video analysis in marine applications. Our preliminary demonstration uses pretrained vision-language models (VLMs) combined with retrieval-augmented generation (RAG) as grounding, leaving the door open for numerous architectural, training and engineering optimizations. We validate this approach through a preliminary application in classifying fish from video onboard fishing vessels, demonstrating impressive emergent retrieval and prediction capabilities without domain-specific training or knowledge of the task itself.

Paper Structure

This paper contains 12 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Architecture of visual RAG. The small pentagons with different colours represent tokens. They are concatenated as input into a language model to generate the final prediction.
  • Figure 2: Example input image and QA with RAG retrieved description (not shown in figure) and without RAG (category list provided but not shown). Images are often low resolution and partly occluded.
  • Figure 3: Precision and recall by category in different experiment settings.
  • Figure 4: Top-k accuracy for the RAG retrieval process.
  • Figure 5: Image embedding 2D visualization of vector store and test set.
  • ...and 1 more figures