Table of Contents
Fetching ...

Unsupervised Document and Template Clustering using Multimodal Embeddings

Phillipe R. Sampaio, Helene Maxcici

TL;DR

This paper tackles unsupervised organization of documents into categories and templates by evaluating a model-agnostic pipeline that projects last-layer multimodal encoder states into fixed-size document vectors for clustering. It systematically compares eight encoders (text-only, layout-aware, vision-only, and vision-language) across five heterogeneous corpora using four clustering algorithms ($k$-Means, DBSCAN, HDBSCAN+$k$-NN, and BIRCH), including an oracle-free tuning protocol. Key findings show that vision-centric encoders excel at clean template discovery but are brittle under covariate shift, whereas text signals provide orthogonal robustness; fused multimodal embeddings (e.g., LayoutLMv3, Donut, Gemma3, InternVL3) offer the best robustness–accuracy trade-off. The work provides a reproducible tuning protocol and detailed evaluation settings, offering practical guidance for scalable intelligent document processing and future research in unsupervised document clustering.

Abstract

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.

Unsupervised Document and Template Clustering using Multimodal Embeddings

TL;DR

This paper tackles unsupervised organization of documents into categories and templates by evaluating a model-agnostic pipeline that projects last-layer multimodal encoder states into fixed-size document vectors for clustering. It systematically compares eight encoders (text-only, layout-aware, vision-only, and vision-language) across five heterogeneous corpora using four clustering algorithms (-Means, DBSCAN, HDBSCAN+-NN, and BIRCH), including an oracle-free tuning protocol. Key findings show that vision-centric encoders excel at clean template discovery but are brittle under covariate shift, whereas text signals provide orthogonal robustness; fused multimodal embeddings (e.g., LayoutLMv3, Donut, Gemma3, InternVL3) offer the best robustness–accuracy trade-off. The work provides a reproducible tuning protocol and detailed evaluation settings, offering practical guidance for scalable intelligent document processing and future research in unsupervised document clustering.

Abstract

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + -NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with -Means, DBSCAN, HDBSCAN + -NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.

Paper Structure

This paper contains 41 sections, 7 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Illustration of two levels of document clustering. In (a), documents are grouped based on their type, such as invoices, ID cards, and receipts. In (b), documents of the same type (e.g., invoices) are further clustered by their specific templates.
  • Figure 2: Clustering approach through multimodal embeddings.
  • Figure 3: Document processing pipeline of each pre-trained model during inference.
  • Figure 4: Two–dimensional t-SNE projections of the mixed-corpus dataset using the eight embedding models. Colors denote document categories; each plot therefore illustrates how well the corresponding representation separates distinct classes in the document-level clustering setting.