Table of Contents
Fetching ...

Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

Jindong Li, Dario Zanca, Vincent Christlein, Tim Hamann, Jens Barth, Peter Kämpf, Björn Eskofier

Abstract

Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.

Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

Abstract

Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.
Paper Structure (21 sections, 3 figures, 3 tables)

This paper contains 21 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Tokenization Evaluation. Performance comparison across varying vocabulary sizes for different tokenizers. The black dashed line represents the character-level baseline. Solid lines with blue squares, red triangles, and green circles denote the Bigram, BPE, and Unigram tokenizers, respectively. Complete results are provided in Appx. \ref{['app:complete_results']}.
  • Figure 2: Concatenation-Based Data Augmentation Evaluation. Performance comparison without augmentation (C0) and with two extra concatenations (C2) across varying vocabulary sizes. Dashed lines represent character-level baselines, where gray indicates no augmentation and black indicates augmented data. Solid lines with square, triangle, and circle markers denote the Bigram, BPE, and Unigram tokenizers, respectively. Lighter, semi-transparent colors represent models without augmentation (C0), whereas darker, opaque colors indicate models with augmentation (C2). Complete results are provided in Appx. \ref{['app:complete_results']}.
  • Figure 3: Character distribution of the right-handed OnHW Words500 dataset. The upper and lower plots show the character distributions for the first fold of the WD and WI splits, respectively. Blue bars represent character frequencies in the training sets, while orange bars represent frequencies in the validation sets.