Table of Contents
Fetching ...

TravelBench : Exploring LLM Performance in Low-Resource Domains

Srinivas Billa, Xiaonan Jing

TL;DR

This work addresses the不足 of travel-domain, low-resource benchmarks by introducing 14 tasks across 7 NLP categories and evaluating 67 LLMs using a uniform prompt-based setup. By analyzing scaling behavior and internal reasoning, the study shows that increases in training FLOPs yield diminishing returns beyond roughly $0.5 \times 10^{16}$, and that reasoning provides a larger boost for smaller models while offering limited gains for very large models. The results highlight domain-adaptation challenges even for high-capacity models and reveal that HELM-based, task-specific evaluation reveals nuanced strengths and weaknesses across generation tasks, including translation and summarisation. The findings underscore the need for domain-aware benchmarking and cost-efficient deployment strategies in travel-domain NLP, where practical performance depends on model size, reasoning capabilities, and task type.

Abstract

Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

TravelBench : Exploring LLM Performance in Low-Resource Domains

TL;DR

This work addresses the不足 of travel-domain, low-resource benchmarks by introducing 14 tasks across 7 NLP categories and evaluating 67 LLMs using a uniform prompt-based setup. By analyzing scaling behavior and internal reasoning, the study shows that increases in training FLOPs yield diminishing returns beyond roughly , and that reasoning provides a larger boost for smaller models while offering limited gains for very large models. The results highlight domain-adaptation challenges even for high-capacity models and reveal that HELM-based, task-specific evaluation reveals nuanced strengths and weaknesses across generation tasks, including translation and summarisation. The findings underscore the need for domain-aware benchmarking and cost-efficient deployment strategies in travel-domain NLP, where practical performance depends on model size, reasoning capabilities, and task type.

Abstract

Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

Paper Structure

This paper contains 25 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Performance $P_m$ against training compute $FLOPs$. While performance generally improves with scale, there are significant diminishing returns past $0.5*10^{16}$.
  • Figure 2: The effect of enabling reasoning across the Qwen3 model family. While smaller models show some improvements, the same does not apply to large models - performance degradation can be seen for the 235B model.
  • Figure 3: Variance of model performance across tasks across the Qwen 3 family. While the performance overall increases with model size, the spread of the performance does not follow the same pattern. This indicates that performance is very task dependent and no one model is the best at every task.