Table of Contents
Fetching ...

Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications

Yanxiang Zhang, Zheng Xu, Shanshan Wu, Yuanbo Zhang, Daniel Ramage

TL;DR

The paper tackles domain shift in mobile error correction by generating a large-scale, high-quality synthetic EC dataset through prompt-driven LLMs enriched with mobile-domain knowledge and then aligning offline evaluation with production metrics via a privacy-preserving reweighting model that leverages DP-FL small LMs and a handful of live A/B metrics. It demonstrates that a continue-training regime that mixes original data with reweighted synthetic data, using LoRA on a billion-parameter LLM, yields consistent improvements in offline evaluations and production metrics, with relative gains up to 7.18% in key KPIs. The work provides practical best practices for data mixing and domain adaptation, and emphasizes privacy safeguards to responsibly leverage in-domain data for mobile LLM applications.

Abstract

Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.

Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications

TL;DR

The paper tackles domain shift in mobile error correction by generating a large-scale, high-quality synthetic EC dataset through prompt-driven LLMs enriched with mobile-domain knowledge and then aligning offline evaluation with production metrics via a privacy-preserving reweighting model that leverages DP-FL small LMs and a handful of live A/B metrics. It demonstrates that a continue-training regime that mixes original data with reweighted synthetic data, using LoRA on a billion-parameter LLM, yields consistent improvements in offline evaluations and production metrics, with relative gains up to 7.18% in key KPIs. The work provides practical best practices for data mixing and domain adaptation, and emphasizes privacy safeguards to responsibly leverage in-domain data for mobile LLM applications.

Abstract

Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.

Paper Structure

This paper contains 10 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Examples of mobile LLM applications for error correction. User typing data has a domain shift compared to public web data. LLMs rewrite and correct highly corrupted text based on the context of the input itself.
  • Figure 2: The statistics of the 20k clusters for 100 million documents. The mean with standard deviation of cluster sizes is $5225 \pm 2972$.
  • Figure 3: Good ratio for error correction on the (a) original validation data and (b) synthetic validation data. The models are trained with the original training data, synthetic training data, and vanilla sampling of synthetic data without clustering. LLMs are used to judge whether the EC output is acceptable to compute good ratio. Our large-scale LLM assisted synthetic data works well on both domains even if there is potential distribution shift from the original dataset collected by error detection on public web data.
  • Figure 4: Comparing the (a) heuristic $\{0, 1\}$ reweighting in wu2024prompt and (b) our reweighting model $w(\theta, \cdot)=0.01 + 1.99 \sigma(40.64 S_f -30.44 S_p -1.59)$. Both methods use public pre-trained small LM $S_p$ and the same model further fine-tuned with DP FL $S_f$. The learnt scores in (b) have large overlap with manual selection in (a) from wu2024prompt.
  • Figure 5: Good ratio for the best of top 3 candidates for error correction on the (a) original validation data and (b) synthetic validation data. Solid lines reweight the samples by the $w(\theta, \cdot)$ model learnt to fit live A/B test metrics in \ref{['sec:adapt']}. The models are trained with synthetic training data with the same setting as in \ref{['fig:top3']}; $\times 4$ increased batch size (synth_lb); mixture of original and synthetic data; and mixture of original and filtered by $w(\theta, \cdot)$ (mix_fil). LLMs are used to judge whether the error correction output is acceptable to compute good ratio.
  • ...and 4 more figures