WordVIS: A Color Worth A Thousand Words
Umar Khan, Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
TL;DR
WordVIS tackles the data- and compute-hungry nature of multimodal document classification by embedding textual semantics directly into the visual representation of document images. It does so as a preprocessing step that color-codes words using per-character scores and a word-length multiplier, producing RGB masks that standard image classifiers can exploit without self-supervised pretraining. On Tobacco-3482, WordVIS yields notable gains on lightweight models (e.g., 3–5% with ResNet50) and sets a new best score of 91.14% with DocXClassifier-B while reducing parameters and latency, demonstrating practical viability for data-scarce settings. Qualitative heatmap analysis confirms that WordVIS directs model attention toward textual regions, supporting its utility as a lightweight, language-agnostic enhancement for document classification.
Abstract
Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.
