Table of Contents
Fetching ...

WordVIS: A Color Worth A Thousand Words

Umar Khan, Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed

TL;DR

WordVIS tackles the data- and compute-hungry nature of multimodal document classification by embedding textual semantics directly into the visual representation of document images. It does so as a preprocessing step that color-codes words using per-character scores and a word-length multiplier, producing RGB masks that standard image classifiers can exploit without self-supervised pretraining. On Tobacco-3482, WordVIS yields notable gains on lightweight models (e.g., 3–5% with ResNet50) and sets a new best score of 91.14% with DocXClassifier-B while reducing parameters and latency, demonstrating practical viability for data-scarce settings. Qualitative heatmap analysis confirms that WordVIS directs model attention toward textual regions, supporting its utility as a lightweight, language-agnostic enhancement for document classification.

Abstract

Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.

WordVIS: A Color Worth A Thousand Words

TL;DR

WordVIS tackles the data- and compute-hungry nature of multimodal document classification by embedding textual semantics directly into the visual representation of document images. It does so as a preprocessing step that color-codes words using per-character scores and a word-length multiplier, producing RGB masks that standard image classifiers can exploit without self-supervised pretraining. On Tobacco-3482, WordVIS yields notable gains on lightweight models (e.g., 3–5% with ResNet50) and sets a new best score of 91.14% with DocXClassifier-B while reducing parameters and latency, demonstrating practical viability for data-scarce settings. Qualitative heatmap analysis confirms that WordVIS directs model attention toward textual regions, supporting its utility as a lightweight, language-agnostic enhancement for document classification.

Abstract

Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: WordVIS as a pre-processing step for existing document classification models. Input document images are first passed through an OCR system to extract textual information. The textual features are then encoded within the visual space of the images using our approach. Finally, the pre-processed images are fed into a document classification model. Shown here is the DocXClassifier-B model.
  • Figure 2: WordVIS samples of different classes produced. We can see that most of the textual data is masked with colors without changing non-textual elements of the document images.
  • Figure 3: In WordVIS colorized document we can see that the colorization adapts a pattern of stop words or words more similar to stop words adapting a green color, whereas the more lengthier and distinct words adapting more distinct colors.
  • Figure 4: Sample heatmaps generated using DocXClassifier-B with WordVIS (Left) and Without WordVIS (Right)
  • Figure 5: Heatmaps Form: Form heatmaps also gives us clues into how the WordVIS (Left) trained network focuses on boxes and the content inside the boxes as opposed to the Base (Right).
  • ...and 1 more figures