The Effectiveness of Graph Contrastive Learning on Mathematical Information Retrieval
Pei-Syuan Wang, Hung-Hsuan Chen
TL;DR
This work reframes mathematical information retrieval as learning structure-aware representations of formulas via graph contrastive learning (GCL). By constructing two graph layouts, Symbol Layout Tree (SLT) and Operator Tree (OPT), and evaluating three GCL methods (InfoGraph, GraphCL, BGRL) in a self-supervised setting, the study demonstrates consistent improvements over the strong TangentCFT baseline on the NTCIR-12 MathIR Wikipedia Formula Browsing Task, using cosine-based retrieval. Key findings include layout-dependent performance of GCL variants and the value of using TangentCFT node embeddings as input to the GCL framework. The work provides a public codebase to promote further research and development in structure-aware MIR for mathematical formulas, with potential extensions to data augmentation and positive-pair generation via templates.
Abstract
This paper details an empirical investigation into using Graph Contrastive Learning (GCL) to generate mathematical equation representations, a critical aspect of Mathematical Information Retrieval (MIR). Our findings reveal that this simple approach consistently exceeds the performance of the current leading formula retrieval model, TangentCFT. To support ongoing research and development in this field, we have made our source code accessible to the public at https://github.com/WangPeiSyuan/GCL-Formula-Retrieval/.
