TravelBench : Exploring LLM Performance in Low-Resource Domains
Srinivas Billa, Xiaonan Jing
TL;DR
This work addresses the不足 of travel-domain, low-resource benchmarks by introducing 14 tasks across 7 NLP categories and evaluating 67 LLMs using a uniform prompt-based setup. By analyzing scaling behavior and internal reasoning, the study shows that increases in training FLOPs yield diminishing returns beyond roughly $0.5 \times 10^{16}$, and that reasoning provides a larger boost for smaller models while offering limited gains for very large models. The results highlight domain-adaptation challenges even for high-capacity models and reveal that HELM-based, task-specific evaluation reveals nuanced strengths and weaknesses across generation tasks, including translation and summarisation. The findings underscore the need for domain-aware benchmarking and cost-efficient deployment strategies in travel-domain NLP, where practical performance depends on model size, reasoning capabilities, and task type.
Abstract
Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.
