OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

Zexin Chen; Chengxi Li; Xiangyu Xie; Parijat Dube

OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

Zexin Chen, Chengxi Li, Xiangyu Xie, Parijat Dube

TL;DR

The OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark, is introduced, providing a replicable blueprint for efficient AI development across various specialized fields.

Abstract

This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.

OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

TL;DR

Abstract

Paper Structure (27 sections, 4 figures, 4 tables)

This paper contains 27 sections, 4 figures, 4 tables.

Introduction
Contributions
Collection of Domain Data
URL Filtering
Sports Text Classifier
Data Filtering and Conversion
Optimizing Model Structure for Sports Domain
Training Setup
OnlySports Benchmark
Tag and Partial Sentence Generation
Model Inference and Evaluation Using SOTA LLMs
Depth and Width Experiments
Experiments
Experimental Settings
Main Results
...and 12 more sections

Figures (4)

Figure 1: Data pipeline to create OnlySports Dataset
Figure 2: Performance comparison with varying depths and widths on OnlySports Benchmark and general zero-shot evaluations
Figure 3: OnlySportsLM training loss over time with varying learning rates. The graph shows how loss fluctuates as we adjust the learning rate, starting from higher rates and gradually decreasing to stabilize training and reduce loss spikes. This insight is shared by the author of RWKV peng2024eaglefinchrwkvmatrixvalued
Figure 4: Evolution of OnlySportsLM performance across training steps. Left graph shows OnlySports Benchmark improving steadily. Right graphs display progress on general tasks, exhibiting upward trends despite fluctuations.

OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

TL;DR

Abstract

OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

Authors

TL;DR

Abstract

Table of Contents

Figures (4)