Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

Rolando Garcia; Pragya Kallanagoudar; Chithra Anand; Sarah E. Chasins; Joseph M. Hellerstein; Erin Michelle Turner Kerrison; Aditya G. Parameswaran

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

Rolando Garcia, Pragya Kallanagoudar, Chithra Anand, Sarah E. Chasins, Joseph M. Hellerstein, Erin Michelle Turner Kerrison, Aditya G. Parameswaran

TL;DR

This paper shows how the component techniques come together to resolve classic software engineering trade-offs between agility and discipline and demonstrates how the data context framework covers a range of both ad-hoc metadata as well as special cases treated today by bespoke feature stores and model repositories.

Abstract

In this paper we present techniques to incrementally harvest and query arbitrary metadata from machine learning pipelines, without disrupting agile practices. We center our approach on the developer-favored technique for generating metadata -- log statements -- leveraging the fact that logging creates context. We show how hindsight logging allows such statements to be added and executed post-hoc, without requiring developer foresight. Relational views of incomplete metadata can be queried to dynamically materialize new metadata in bulk and on demand across multiple versions of workflows. This is done in a "metadata later" style, off the critical path of agile development. We realize these ideas in a system called FlorDB and demonstrate how the data context framework covers a range of both ad-hoc metadata as well as special cases treated today by bespoke feature stores and model repositories. Through a usage scenario -- including both ML and human feedback -- we illustrate how the component techniques come together to resolve classic software engineering trade-offs between agility and discipline.

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

TL;DR

Abstract

Paper Structure (17 sections, 6 figures)

This paper contains 17 sections, 6 figures.

Introduction
A Crisis of Metadata Management
Goals and Contributions
Multiversion Hindsight Logging
FlorDB Extended API
Incremental Context Maintenance
Application Context
Behavioral & Change Context
PDF Parser Demo
PDF Extraction & Text Featurization
Inference Pipeline
Training Pipeline
Closing the Loop: Feedback via UI
Discussion
Implications for Social Justice Research
...and 2 more sections

Figures (6)

Figure 1: Extended FlorDB data model in Crow's Foot notation. Basic tables denoted in white; virtual tables in gray.
Figure 2: ML Pipeline with Feedback: Makefile, Dataflow Diagram, and Flor Dataframe.
Figure 3: Data featurization with FlorDB
Figure 4: Screenshot of the PDF Parser (left) and its respective Makefile (right).
Figure 5: Training on labeled data managed by FlorDB
...and 1 more figures

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

TL;DR

Abstract

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

Authors

TL;DR

Abstract

Table of Contents

Figures (6)