Table of Contents
Fetching ...

Is Open Source the Future of AI? A Data-Driven Approach

Domen Vake, Bogdan Šinik, Jernej Vičič, Aleksandar Tošić

TL;DR

Is Open Source the Future of AI? A Data-Driven Approach investigates whether open-source AI can advance LLM development while addressing privacy, transparency, and misuse concerns. The authors build a data-driven assessment using Hugging Face leaderboards and HFCommunity data, collecting model metadata, contributions, and benchmark performance across ARC, HellaSwag, MMLU, TruthfulQA, Winograd, and GSM8K. They show that open-source contributions can improve model performance with smaller architectures, noting a rise in fine-tuned and chat models and a concentration of activity around a few base models like Llama and Mistral. They argue for a balanced view on openness, highlighting how community-driven refinements and incentives interact with base-model providers, and discuss policy and future research directions to operationalize open-source openness in AI.

Abstract

Large Language Models (LLMs) have become central in academia and industry, raising concerns about privacy, transparency, and misuse. A key issue is the trustworthiness of proprietary models, with open-sourcing often proposed as a solution. However, open-sourcing presents challenges, including potential misuse, financial disincentives, and intellectual property concerns. Proprietary models, backed by private sector resources, are better positioned for return on investment. There are also other approaches that lie somewhere on the spectrum between completely open-source and proprietary. These can largely be categorised into open-source usage limitations protected by licensing, partially open-source (open weights) models, hybrid approaches where obsolete model versions are open-sourced, while competitive versions with market value remain proprietary. Currently, discussions on where on the spectrum future models should fall on remains unbacked and mostly opinionated where industry leaders are weighing in on the discussion. In this paper, we present a data-driven approach by compiling data on open-source development of LLMs, and their contributions in terms of improvements, modifications, and methods. Our goal is to avoid supporting either extreme but rather present data that will support future discussions both by industry experts as well as policy makers. Our findings indicate that open-source contributions can enhance model performance, with trends such as reduced model size and manageable accuracy loss. We also identify positive community engagement patterns and architectures that benefit most from open contributions.

Is Open Source the Future of AI? A Data-Driven Approach

TL;DR

Is Open Source the Future of AI? A Data-Driven Approach investigates whether open-source AI can advance LLM development while addressing privacy, transparency, and misuse concerns. The authors build a data-driven assessment using Hugging Face leaderboards and HFCommunity data, collecting model metadata, contributions, and benchmark performance across ARC, HellaSwag, MMLU, TruthfulQA, Winograd, and GSM8K. They show that open-source contributions can improve model performance with smaller architectures, noting a rise in fine-tuned and chat models and a concentration of activity around a few base models like Llama and Mistral. They argue for a balanced view on openness, highlighting how community-driven refinements and incentives interact with base-model providers, and discuss policy and future research directions to operationalize open-source openness in AI.

Abstract

Large Language Models (LLMs) have become central in academia and industry, raising concerns about privacy, transparency, and misuse. A key issue is the trustworthiness of proprietary models, with open-sourcing often proposed as a solution. However, open-sourcing presents challenges, including potential misuse, financial disincentives, and intellectual property concerns. Proprietary models, backed by private sector resources, are better positioned for return on investment. There are also other approaches that lie somewhere on the spectrum between completely open-source and proprietary. These can largely be categorised into open-source usage limitations protected by licensing, partially open-source (open weights) models, hybrid approaches where obsolete model versions are open-sourced, while competitive versions with market value remain proprietary. Currently, discussions on where on the spectrum future models should fall on remains unbacked and mostly opinionated where industry leaders are weighing in on the discussion. In this paper, we present a data-driven approach by compiling data on open-source development of LLMs, and their contributions in terms of improvements, modifications, and methods. Our goal is to avoid supporting either extreme but rather present data that will support future discussions both by industry experts as well as policy makers. Our findings indicate that open-source contributions can enhance model performance, with trends such as reduced model size and manageable accuracy loss. We also identify positive community engagement patterns and architectures that benefit most from open contributions.

Paper Structure

This paper contains 5 sections, 12 figures.

Figures (12)

  • Figure 1: Total authors and new authors on the leader board over time
  • Figure 2: Distribution of number of repositories per author
  • Figure 3: Distribution of number of authors per repository
  • Figure 4: Number of new models per week by type
  • Figure 5: Total models per week based on architecture
  • ...and 7 more figures