StreamLink: Large-Language-Model Driven Distributed Data Engineering System
Dawei Feng, Di Mei, Huiri Tan, Lei Ren, Xianying Lou, Zhangxi Tan
TL;DR
The paper tackles enabling a user-friendly natural-language interface to billions of distributed data records while preserving user privacy. It presents a StreamLink architecture that fuses Spark/Hadoop-based storage with localized, domain-adapted LLMs for NL-to-SQL translation and a Llama-based syntax/security checker, delivering secure, NL-driven data access. The NL-to-SQL component leverages context-target pairs $Z=\{(x_i,y_i)\}_{i=1}^N$ and optimizes $ \max_{\Theta} \sum_{(x,y)\in Z} \sum_{t=1}^{|y|} \log (p_{\Phi_0+\Delta\Phi(\Theta)}(y_t|x,y_{<t}))$, with an alternative $ \max_{\Phi} \sum_{(x,y)\in Z} \sum_{t=1}^{|y|} \log (P_{\Phi}(y_t|x, y_{<t}))$, and applies bi-directional data augmentation to adapt templates to domain schemas. Empirical results on a 180M-patent dataset and the Spider NL-to-SQL benchmark show >10 percentage-point gains in exact-match and execution accuracy; malicious-SQL interception with SSQLC3-8B achieves high recall and practical throughput, indicating strong, safe NL-driven querying capabilities. Collectively, these findings highlight StreamLink’s potential to transform data engineering by delivering secure, scalable, and accessible natural-language interfaces for large-scale data systems.
Abstract
Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10\% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.
