Small Language Models for Application Interactions: A Case Study

Beibin Li; Yi Zhang; Sébastien Bubeck; Jeevan Pathuri; Ishai Menache

Small Language Models for Application Interactions: A Case Study

Beibin Li, Yi Zhang, Sébastien Bubeck, Jeevan Pathuri, Ishai Menache

TL;DR

This paper evaluates Small Language Models (SLMs) for enabling natural-language interactions with a Microsoft internal cloud-supply-chain fulfilment application. It demonstrates that fine-tuned SLMs such as Phi-3, Llama 3, and Mistral v0.2 can achieve higher accuracy and faster responses than larger LLMs, using modest training data and enabling on-device deployment. The authors present a data-generation and fine-tuning pipeline, including templates, prompt evolution, and LoRA-based training to translate user queries into Python code that invokes internal APIs. The study highlights practical implications for edge-friendly enterprise NL interfaces and provides design considerations for managing in-domain/out-of-domain queries, task routing, and cost.

Abstract

We study the efficacy of Small Language Models (SLMs) in facilitating application usage through natural language interactions. Our focus here is on a particular internal application used in Microsoft for cloud supply chain fulfilment. Our experiments show that small models can outperform much larger ones in terms of both accuracy and running time, even when fine-tuned on small datasets. Alongside these results, we also highlight SLM-based system design considerations.

Small Language Models for Application Interactions: A Case Study

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 2 tables)

This paper contains 18 sections, 3 figures, 2 tables.

Introduction
Background and Motivation
Small Language Models
The Phi models:
Cloud supply chain fulfillment
Language-model system design
Natural-language interaction challenges.
SLM fine-tuning.
Output example.
Evaluation
Accuracy
Metrics.
Results.
Performance and cost
Token count and latency.
...and 3 more sections

Figures (3)

Figure 1: Overall accuracy as a function of the number of the per-task examples. For LLMs, the examples are given in the prompt, and for SLMs they are used for the offline fine tuning.
Figure 2: LM-Sys logical flow. User queries are first processed to determine if they are in-domain. For in-domain queries, LM-Sys generates the relevant code snippet that is executed to provide the answer. Otherwise, the SLM will return a default response (e.g., guide the user about supported tasks.
Figure 3: An example of an interaction log from production.

Small Language Models for Application Interactions: A Case Study

TL;DR

Abstract

Small Language Models for Application Interactions: A Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (3)