Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Robert Wolfe; Isaac Slaughter; Bin Han; Bingbing Wen; Yiwei Yang; Lucas Rosenblatt; Bernease Herman; Eva Brown; Zening Qu; Nic Weber; Bill Howe

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, Bill Howe

TL;DR

This paper evaluates whether small open-weight language models, when finely tuned with low-data regimes on commodity hardware, can match or exceed the performance of large closed models like GPT-4-Turbo in zero-shot, few-shot, and fine-tuned settings. Using three open 7B models and two OpenAI baselines, the authors demonstrate that modest-domain fine-tuning yields competitive results across entity resolution, climate fact-checking, and clinical dialogue summarization, with substantial cost savings. They also investigate responsible-use aspects, finding privacy advantages for privately fine-tuned open models, but mixed outcomes for bias and abstention depending on the task and level of fine-tuning. The work advocates laboratory-scale AI as a feasible, transparent, and cost-effective alternative to proprietary APIs, particularly in risk-sensitive, resource-constrained contexts, while noting limitations around pretraining data transparency and resource availability.

Abstract

The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings.

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

TL;DR

Abstract

Paper Structure (25 sections, 4 figures, 7 tables)

This paper contains 25 sections, 4 figures, 7 tables.

Introduction
Related Work
Approach
Models
Defining Closed vs. Open Models
Closed Models
Open Models
Model Evaluation
Hyperparameters
Cloud Infrastructure
Multifaceted Evaluation of Open vs. Closed Models
Representative General Tasks
Performance --- Fine-tuned Open Models Can Outperform Closed Models
Cost Analysis --- Open Models Are More Affordable
Data Responsiveness --- Modest Fine-tuning Can Make Open Models Competitive
...and 10 more sections

Figures (4)

Figure 1: We compared domain-specific performance, general-purpose usability, and amenability to responsible use of three open language models with two dominant closed models. We found that fine-tuning open models renders them competitive with few-shot closed models at low cost.
Figure 2: Left: Fine-tuning improvements emerge during the first 50% of the training data, only a few hundred training samples in the case of Medical Summarization and Entity Resolution. Right: Finetuned open models are competitive with finetuned GPT-3.5-Turbo with little data (1,000 fact-checking samples).
Figure 3: Models fine-tuned on a task using qLoRA offer strong zero-shot performance on other tasks, often stronger than the base model.
Figure 4: Increasing privacy (by decreasing $\epsilon$) leads to noisier gradients, delaying convergence; but privately trained open models do learn.

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

TL;DR

Abstract

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (4)