Table of Contents
Fetching ...

LocalGPT: Benchmarking and Advancing Large Language Models for Local Life Services in Meituan

Xiaochong Lan, Jie Feng, Jiahuan Lei, Xinlei Shi, Yong Li

TL;DR

The paper tackles the challenge of applying large language models to local life services, where domain shift and task diversity impede zero-shot generalization. It introduces LocalEval, a 41-task benchmark across four categories, and a multi-agent instruction synthesis pipeline (LocalInstruction) to produce high-quality fine-tuning data, complemented by agentic workflows for complex composite tasks. The experiments show that a 7B parameter model, when fine-tuned with LocalInstruction, can reach parity with a 72B model on LocalEval tasks, enabling cost-effective deployment; cross-city analyses reveal the importance of diverse domain data for transferability. Real-world deployments on Meituan demonstrate tangible business benefits in recommendation, search, and review ranking, validating the approach's practical value and scalability for local life platforms.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities and achieved significant breakthroughs across various domains, leading to their widespread adoption in recent years. Building on this progress, we investigate their potential in the realm of local life services. In this study, we establish a comprehensive benchmark and systematically evaluate the performance of diverse LLMs across a wide range of tasks relevant to local life services. To further enhance their effectiveness, we explore two key approaches: model fine-tuning and agent-based workflows. Our findings reveal that even a relatively compact 7B model can attain performance levels comparable to a much larger 72B model, effectively balancing inference cost and model capability. This optimization greatly enhances the feasibility and efficiency of deploying LLMs in real-world online services, making them more practical and accessible for local life applications.

LocalGPT: Benchmarking and Advancing Large Language Models for Local Life Services in Meituan

TL;DR

The paper tackles the challenge of applying large language models to local life services, where domain shift and task diversity impede zero-shot generalization. It introduces LocalEval, a 41-task benchmark across four categories, and a multi-agent instruction synthesis pipeline (LocalInstruction) to produce high-quality fine-tuning data, complemented by agentic workflows for complex composite tasks. The experiments show that a 7B parameter model, when fine-tuned with LocalInstruction, can reach parity with a 72B model on LocalEval tasks, enabling cost-effective deployment; cross-city analyses reveal the importance of diverse domain data for transferability. Real-world deployments on Meituan demonstrate tangible business benefits in recommendation, search, and review ranking, validating the approach's practical value and scalability for local life platforms.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities and achieved significant breakthroughs across various domains, leading to their widespread adoption in recent years. Building on this progress, we investigate their potential in the realm of local life services. In this study, we establish a comprehensive benchmark and systematically evaluate the performance of diverse LLMs across a wide range of tasks relevant to local life services. To further enhance their effectiveness, we explore two key approaches: model fine-tuning and agent-based workflows. Our findings reveal that even a relatively compact 7B model can attain performance levels comparable to a much larger 72B model, effectively balancing inference cost and model capability. This optimization greatly enhances the feasibility and efficiency of deploying LLMs in real-world online services, making them more practical and accessible for local life applications.

Paper Structure

This paper contains 37 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An overview of our approach. We first develop LocalEval Benchmark to systematically evaluate LLMs' capabilities in understanding local life services. Based on a multi-agent collaboration approach, we construct LocalInstruction to enhance LLMs' service understanding capabilities through fine-tuning. For composite tasks, we implement expert agents to further improve LLMs' problem-solving abilities.
  • Figure 2: Task and category-wise correlations on LocalEval.
  • Figure 3: Results of instruction tuning on Qwen2.5 series. Through fine-tuning on LocalInstruction, the performance of Qwen2.5-7B can match the performance of much larger Qwen2.5-72B.
  • Figure 4: Results of instruction tuning on Llama3 series. Through fine-tuning on LocalInstruction, the performance of Llama3.1-8B can match the performance of much larger Llama3.3-70B.
  • Figure 5: Results of ablation studies on Qwen2.5-7B.
  • ...and 1 more figures