A Natural Language Interface for Efficient Data Retrieval in SDSS
Prathamesh Tamhane
TL;DR
The paper tackles the barrier of querying SDSS data by fine-tuning a compact transformer, Phi-2, on a domain-specific NL-SQL corpus to serve as a natural language interface for SkyServer. It builds a diverse NL-SQL dataset through tutorials, paraphrase augmentation, and synthetic generation, then applies LoRA adapters to enable efficient, offline deployment. Results show strong syntactic reliability (≈94% SQL validity) and meaningful but room-for-improvement semantic accuracy (60–70%), highlighting the value of domain grounding and dataset quality. The work demonstrates the feasibility and practical impact of lightweight NLIDBs in astronomy, with clear pathways to scale to future, larger surveys like LSST, DESI, and SKA.
Abstract
Modern astronomical surveys such as the Sloan Digital Sky Survey (SDSS) provide extensive astronomical databases enabling researchers to access vast amount of diverse data. However, retrieving data from archives requires knowledge of query languages and familiarity with their schema, which presents a barrier for non-experts. This work investigates the use of Microsoft Phi-2, a compact yet powerful transformer-based language model, fine-tuned on natural language--SQL pairs constructed from SDSS query examples. We develop an interface that translates user queries in natural language into SQL commands compatible with SDSS SkyServer. Preliminary evaluation shows that the fine-tuned model produces syntactically valid and largely semantically correct queries across a variety of astronomy-related requests. Our results show that even small-scale models, when carefully fine-tuned, can provide effective domain-specific natural language interfaces for large scientific databases.
