Table of Contents
Fetching ...

Learning Page Order in Shuffled WOO Releases

Efe Kahraman, Giulio Tosato

TL;DR

This work tackles recovering the chronological page order from shuffled Dutch WOO releases, a task complicated by heterogeneous content and scarce per-page metadata. It evaluates five permutation-family approaches—pointer networks, seq2seq transformers, and pairwise ranking transformers—under universal, curriculum, and length-specialized training regimes. The results show that a length-specialized, non-autoregressive pairwise ranking transformer delivers the strongest performance for documents up to 15 pages, with Kendall's $\tau$ reaching up to $0.953$ on 2–5 pages and $0.722$ on 11–15 pages, while seq2seq baselines catastrophically fail to generalize to long documents ($\tau$ dropping to $0.014$); curriculum learning also underperforms direct training. These findings highlight the importance of length-aware architectures and training strategies for ordering pages in heterogeneous, multi-type document collections, and point to future work in multimodal embeddings and length-extrapolating encodings.

Abstract

We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall's tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).

Learning Page Order in Shuffled WOO Releases

TL;DR

This work tackles recovering the chronological page order from shuffled Dutch WOO releases, a task complicated by heterogeneous content and scarce per-page metadata. It evaluates five permutation-family approaches—pointer networks, seq2seq transformers, and pairwise ranking transformers—under universal, curriculum, and length-specialized training regimes. The results show that a length-specialized, non-autoregressive pairwise ranking transformer delivers the strongest performance for documents up to 15 pages, with Kendall's reaching up to on 2–5 pages and on 11–15 pages, while seq2seq baselines catastrophically fail to generalize to long documents ( dropping to ); curriculum learning also underperforms direct training. These findings highlight the importance of length-aware architectures and training strategies for ordering pages in heterogeneous, multi-type document collections, and point to future work in multimodal embeddings and length-extrapolating encodings.

Abstract

We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall's tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).
Paper Structure (14 sections, 2 equations, 4 figures, 2 tables)

This paper contains 14 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Kendall's $\tau$ by method and document length range.
  • Figure 2: Short document (2--5 pages) vs long document (21--25 pages) performance. Points below the diagonal indicate failure to scale.
  • Figure 3: seq2seq transformer positional encoding ablations. Left: Absolute Kendall's $\tau$ by document length. Right: Relative improvement over learned baseline. Notably, no positional encodings helps on medium-length documents but fails on long documents.
  • Figure 4: Training dynamics for seq2seq positional encoding variants. Validation Kendall's $\tau$ across epochs shows high oscillation for all variants, with sinusoidal encodings providing the most stable training.