Table of Contents
Fetching ...

Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching

Youchao Zhou, Heyan Huang, Zhijing Wu, Yuhang Liu, Xinglin Wang

TL;DR

This work tackles long-form document matching by addressing the heterogeneity of subtopics within documents. It introduces the SST framework, combining subtopic-aware view sampling with temporal aggregation to learn from multiple, representative views progressively during training. Key innovations include two subtopic discovery variants (direct and adaptive clustering) with dedicated losses to emphasize alignment and complementarity, and three view-sampling strategies that generate diverse yet informative views. Empirical results on news duplication and legal case retrieval demonstrate that SST yields consistent improvements over strong baselines, validating its ability to capture detailed, topic-specific matching signals and integrate them effectively over time.

Abstract

Long-form document matching aims to judge the relevance between two documents and has been applied to various scenarios. Most existing works utilize hierarchical or long context models to process documents, which achieve coarse understanding but may ignore details. Some researchers construct a document view with similar sentences about aligned document subtopics to focus on detailed matching signals. However, a long document generally contains multiple subtopics. The matching signals are heterogeneous from multiple topics. Considering only the homologous aligned subtopics may not be representative enough and may cause biased modeling. In this paper, we introduce a new framework to model representative matching signals. First, we propose to capture various matching signals through subtopics of document pairs. Next, We construct multiple document views based on subtopics to cover heterogeneous and valuable details. However, existing spatial aggregation methods like attention, which integrate all these views simultaneously, are hard to integrate heterogeneous information. Instead, we propose temporal aggregation, which effectively integrates different views gradually as the training progresses. Experimental results show that our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.

Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching

TL;DR

This work tackles long-form document matching by addressing the heterogeneity of subtopics within documents. It introduces the SST framework, combining subtopic-aware view sampling with temporal aggregation to learn from multiple, representative views progressively during training. Key innovations include two subtopic discovery variants (direct and adaptive clustering) with dedicated losses to emphasize alignment and complementarity, and three view-sampling strategies that generate diverse yet informative views. Empirical results on news duplication and legal case retrieval demonstrate that SST yields consistent improvements over strong baselines, validating its ability to capture detailed, topic-specific matching signals and integrate them effectively over time.

Abstract

Long-form document matching aims to judge the relevance between two documents and has been applied to various scenarios. Most existing works utilize hierarchical or long context models to process documents, which achieve coarse understanding but may ignore details. Some researchers construct a document view with similar sentences about aligned document subtopics to focus on detailed matching signals. However, a long document generally contains multiple subtopics. The matching signals are heterogeneous from multiple topics. Considering only the homologous aligned subtopics may not be representative enough and may cause biased modeling. In this paper, we introduce a new framework to model representative matching signals. First, we propose to capture various matching signals through subtopics of document pairs. Next, We construct multiple document views based on subtopics to cover heterogeneous and valuable details. However, existing spatial aggregation methods like attention, which integrate all these views simultaneously, are hard to integrate heterogeneous information. Instead, we propose temporal aggregation, which effectively integrates different views gradually as the training progresses. Experimental results show that our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.

Paper Structure

This paper contains 25 sections, 12 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example (translated from Chinese) of the news duplication task. The two documents describe different news events (the label is “negative”) but have aligned main subtopics. Existing methods pre-select sentences to form a single extracted summary (document view) and assume a local view that contains the most aligned part is salient and sufficient for the task. Learning such a homologous view with similar sentences may mislead the model. On the contrary, representative views combined with aligned subtopics and proper complementary subtopics indicate a mismatched relationship.
  • Figure 2: The comparison of different model architectures.
  • Figure 3: The process of our SST framework. It contains two strategies: subtopic-aware view sampling and temporal aggregation. The first strategy contains two steps: the document pair subtopic discovery and the view sampling.
  • Figure 4: Performance of typical models on document pairs of different lengths (# words).
  • Figure 5: The impact of receptive fields of different model.
  • ...and 4 more figures