Model Lakes
Koyena Pal, David Bau, Renée J. Miller
TL;DR
The paper introduces Model Lakes as a data-lake–inspired paradigm to manage the growing set of pre-trained models and their often incomplete documentation. It formalizes three model perspectives—history, intrinsic, and extrinsic—and defines core tasks of attribution, versioning, search, and benchmarking to support transparent model comparison. It surveys the current state of model management and proposes a roadmap spanning indexing, embeddings, weight-space modeling, interpretability, and data provenance integration, along with practical inference and benchmarking needs. It concludes by calling the database community to build Agora-like, end-to-end platforms that unify provenance, discovery, and lifecycle management for heterogeneous models.
Abstract
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding models become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management.
