Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery
Anand Krishnakumar, Vengadesh Ravikumaran
TL;DR
This work tackles template discovery in spreadsheets by introducing a hybrid cell-level distance that jointly encodes spatial layout, data-type patterns, and semantic content. A weighted distance $d_c$ combines $d_{\text{spatial}}$, $d_{\text{type}}$, and $d_{\text{semantic}}$, with spreadsheet-level similarity obtained via Chamfer or Hausdorff aggregations. Empirical results on the FUSTE dataset show that Chamfer-based similarity with the proposed embedding achieves perfect template reconstruction ($\text{ARI}=1.00$), surpassing the graph-based Mondrian baseline ($\text{ARI}=0.90$); Hausdorff underperforms. The approach enables scalable template discovery for downstream tasks like retrieval-augmented generation and automated data wrangling, improving processing of large tabular collections.
Abstract
Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.
