FathomGPT: A Natural Language Interface for Interactively Exploring Ocean Science Data
Nabin Khanal, Chun Meng Yu, Jui-Cheng Chiu, Anav Chaudhary, Ziyue Zhang, Kakani Katija, Angus G. Forbes
TL;DR
FathomGPT delivers a natural language interface that enables free-form exploration of ocean science data in FathomNet by integrating a prompt evaluator, knowledge-graph–based name resolution, specialized text-to-SQL models, and Plotly visualizations. The system emphasizes low latency, robust error handling, and session memory to support interactive data exploration, including image-pattern searches and pattern-based retrieval. Through ablation studies and KK-name resolution benchmarks, the authors demonstrate improvements from fine-tuning and context-aware prompting, and show the KG-based approach outperforms alternatives like GPT-4o and vector embeddings for name resolution. The work highlights practical usage scenarios and workshop feedback, arguing that FathomGPT can accelerate marine data analysis and potentially generalize to other scientific databases and domains.
Abstract
We introduce FathomGPT, an open source system for the interactive investigation of ocean science data via a natural language interface. FathomGPT was developed in close collaboration with marine scientists to enable researchers to explore and analyze the FathomNet image database. FathomGPT provides a custom information retrieval pipeline that leverages OpenAI's large language models to enable: the creation of complex queries to retrieve images, taxonomic information, and scientific measurements; mapping common names and morphological features to scientific names; generating interactive charts on demand; and searching by image or specified patterns within an image. In designing FathomGPT, particular emphasis was placed on enhancing the user's experience by facilitating free-form exploration and optimizing response times. We present an architectural overview and implementation details of FathomGPT, along with a series of ablation studies that demonstrate the effectiveness of our approach to name resolution, fine tuning, and prompt modification. We also present usage scenarios of interactive data exploration sessions and document feedback from ocean scientists and machine learning experts.
