Table of Contents
Fetching ...

Adapting Methods for Domain-Specific Japanese Small LMs: Scale, Architecture, and Quantization

Takato Yasuno

Abstract

This paper presents a systematic methodology for building domain-specific Japanese small language models using QLoRA fine-tuning. We address three core questions: optimal training scale, base-model selection, and architecture-aware quantization. Stage 1 (Training scale): Scale-learning experiments (1k--5k samples) identify n=4,000 as optimal, where test-set NLL reaches minimum (1.127) before overfitting at 5k samples. Stage 2 (Compare finetuned SLMs): Comparing four Japanese LLMs shows that Llama-3 models with Japanese continual pre-training (Swallow-8B, ELYZA-JP-8B) outperform multilingual models (Qwen2.5-7B). Stage 3 (Quantization): Llama-3 architectures improve under Q4_K_M quantization, while GQA architectures degrade severely (Qwen2.5: -0.280 points). Production recommendation: Swallow-8B Q4_K_M achieves 2.830/3 score, 8.9 s/question, 4.9 GB size. The methodology generalizes to low-resource technical domains and provides actionable guidance for compact Japanese specialist LMs on consumer hardware.

Adapting Methods for Domain-Specific Japanese Small LMs: Scale, Architecture, and Quantization

Abstract

This paper presents a systematic methodology for building domain-specific Japanese small language models using QLoRA fine-tuning. We address three core questions: optimal training scale, base-model selection, and architecture-aware quantization. Stage 1 (Training scale): Scale-learning experiments (1k--5k samples) identify n=4,000 as optimal, where test-set NLL reaches minimum (1.127) before overfitting at 5k samples. Stage 2 (Compare finetuned SLMs): Comparing four Japanese LLMs shows that Llama-3 models with Japanese continual pre-training (Swallow-8B, ELYZA-JP-8B) outperform multilingual models (Qwen2.5-7B). Stage 3 (Quantization): Llama-3 architectures improve under Q4_K_M quantization, while GQA architectures degrade severely (Qwen2.5: -0.280 points). Production recommendation: Swallow-8B Q4_K_M achieves 2.830/3 score, 8.9 s/question, 4.9 GB size. The methodology generalizes to low-resource technical domains and provides actionable guidance for compact Japanese specialist LMs on consumer hardware.
Paper Structure (69 sections, 8 equations, 11 figures, 6 tables)

This paper contains 69 sections, 8 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Three-stage progressive optimization methodology. Stage 1 identifies optimal training scale (n=4,000) via test-set NLL analysis across five dataset sizes. Stage 2 compares four Japanese LLMs under identical QLoRA training, revealing Japanese continual pre-training advantage (Swallow-8B: 2.830/3). Stage 3 discovers that Llama-3 models improve under Q4_K_M quantization (+0.01 to +0.03 points). Dashed arrows indicate data flow between stages.
  • Figure 2: Relation-type distribution across five training scales. HAS_CHAPTER proportion stabilizes at 17.1% across all scales after undersampling, confirming uniform distribution scaling without introducing bias.
  • Figure 3: Answer length stability across five training scales. Average answer length remains consistent (370--380 characters) across all subset sizes, indicating that scale variation does not affect response complexity.
  • Figure 4: Quality metrics summary across five training scales. All quality metrics show stable distributions with coefficient of variation $<$ 3%, confirming that scale variation does not introduce confounding factors in dataset composition.
  • Figure 5: Training loss curves for five dataset scales (n=1,000 to 5,000). All models exhibit a characteristic three-phase pattern: rapid initial descent (epoch 0--0.4), plateau (1.0--2.0), and late-stage collapse (2.2--3.0). Larger datasets maintain consistently lower loss throughout training, with final loss decreasing from 0.869 (n=1,000) to 0.767 (n=5,000).
  • ...and 6 more figures