Table of Contents
Fetching ...

Language Models as a Service: Overview of a New Paradigm and its Challenges

Emanuele La Malfa, Aleksandar Petrov, Simon Frieder, Christoph Weinhuber, Ryan Burnell, Raza Nazar, Anthony G. Cohn, Nigel Shadbolt, Michael Wooldridge

TL;DR

The paper analyzes the Language-Models-as-a-Service (LMaaS) paradigm, identifying four core challenges—accessibility, replicability, reliability, and trustworthiness—that arise from centralized, pay-per-use interfaces and limited model transparency. It surveys licensing landscapes, deployment practices, and evaluation issues, highlighting data and user contamination, non-determinism, and emergent behavior as key reliability and benchmarking obstacles. Through a synthesis of current knowledge and case studies, it offers a tentative, community-driven agenda with concrete recommendations for accessibility, legacy access, benchmarking, data provenance, and robust explainability. The work aims to guide researchers and providers toward LMaaS ecosystems that are more open to audit, reproducible under evolving deployments, and trustworthy in decision-making, including safety-critical contexts.

Abstract

Some of the most powerful language models currently are proprietary systems, accessible only via (typically restrictive) web or software programming interfaces. This is the Language-Models-as-a-Service (LMaaS) paradigm. In contrast with scenarios where full model access is available, as in the case of open-source models, such closed-off language models present specific challenges for evaluating, benchmarking, and testing them. This paper has two goals: on the one hand, we delineate how the aforementioned challenges act as impediments to the accessibility, replicability, reliability, and trustworthiness of LMaaS. We systematically examine the issues that arise from a lack of information about language models for each of these four aspects. We conduct a detailed analysis of existing solutions and put forth a number of considered recommendations, and highlight the directions for future advancements. On the other hand, it serves as a comprehensive resource for existing knowledge on current, major LMaaS, offering a synthesized overview of the licences and capabilities their interfaces offer.

Language Models as a Service: Overview of a New Paradigm and its Challenges

TL;DR

The paper analyzes the Language-Models-as-a-Service (LMaaS) paradigm, identifying four core challenges—accessibility, replicability, reliability, and trustworthiness—that arise from centralized, pay-per-use interfaces and limited model transparency. It surveys licensing landscapes, deployment practices, and evaluation issues, highlighting data and user contamination, non-determinism, and emergent behavior as key reliability and benchmarking obstacles. Through a synthesis of current knowledge and case studies, it offers a tentative, community-driven agenda with concrete recommendations for accessibility, legacy access, benchmarking, data provenance, and robust explainability. The work aims to guide researchers and providers toward LMaaS ecosystems that are more open to audit, reproducible under evolving deployments, and trustworthy in decision-making, including safety-critical contexts.

Abstract

Some of the most powerful language models currently are proprietary systems, accessible only via (typically restrictive) web or software programming interfaces. This is the Language-Models-as-a-Service (LMaaS) paradigm. In contrast with scenarios where full model access is available, as in the case of open-source models, such closed-off language models present specific challenges for evaluating, benchmarking, and testing them. This paper has two goals: on the one hand, we delineate how the aforementioned challenges act as impediments to the accessibility, replicability, reliability, and trustworthiness of LMaaS. We systematically examine the issues that arise from a lack of information about language models for each of these four aspects. We conduct a detailed analysis of existing solutions and put forth a number of considered recommendations, and highlight the directions for future advancements. On the other hand, it serves as a comprehensive resource for existing knowledge on current, major LMaaS, offering a synthesized overview of the licences and capabilities their interfaces offer.
Paper Structure (23 sections, 5 figures, 1 table)

This paper contains 23 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of the difference between interacting with LMaaS and user-controlled LMs. Most LMs that are offered in full provide access to a model's internals (e.g., its weights) and list details on the training procedure and instructions on executing the models locally (the so-called model card wolf2019huggingface). In most cases, this allows users to run (or change) these models on the hardware of their choice. On the other hand, LMaaS are accessible through a web interface or an API (for illustrative purposes, in the diagram above, APIs resemble those of OpenAI). They are powered by LMs that run behind the scenes, typically on computational infrastructures controlled by third parties, and little information about the model is exposed to the user.
  • Figure 2: LMs, sourced from the https://huggingface.co/models library, grouped by how they are provided regarding software licences. Half of the LMs on Huggingface are provided with Apache-2.0 licence, followed by the MIT licence and the OpenRAIL licence, which is a new licence, specifically devised for machine-learning applications. Last access 01/08/2023.
  • Figure 3: Setting the temperature $T$ to zero makes (typically) LMaaS deterministic, with the probability of sampling the next word that concentrates on a single word. On the other hand, for $T>0$, LMaaS become progressively more creative, but makes the distribution over which they sample progressively more flat.
  • Figure 4: Extensive benchmarking techniques (top) suffer from poor inspectability as they have a cubic growth rate w.r.t. the number of models, datasets and metrics involved. On the other hand, techniques that cluster models based on dimensionality reduction and distilled latent factors (bottom) aggregate multiple datasets and/or metrics but suffer from poor interpretability.
  • Figure 5: LMaaS can be used to generate explanations. On the left is a real interaction with GPT-4 (dated September 21, 2023), asked to solve a sentiment analysis task. In this instance the reasoning is sound, but we have no guarantee of correctness. On the right, the workings of a fictitious decision tree to solve the same sentiment analysis task are illustrated on text represented as $2$-grams: While it can misclassify the input as "positive", after having in represented the input as vectors, it is possible to trace the reason for the misclassification back to the existence of the $2$-gram "really good", which affected some of the decisions that were taken. In this sense, techniques like decision trees are self-explaining, as they embed explanations from which one can derive the model's decision-making process.