Table of Contents
Fetching ...

Learning Soft Linear Constraints with Application to Citation Field Extraction

Sam Anzaroot, Alexandre Passos, David Belanger, Andrew McCallum

TL;DR

The paper tackles citation field extraction by introducing soft global constraints into a base CRF via dual decomposition, coupled with a learning procedure that assigns penalties to constraint violations. This Soft-DD approach enables automatic constraint selection from large candidate sets and provides optimization certificates, achieving notable gains over a chain-structured CRF on a challenging dataset. By integrating constraint templates ranging from singleton and pairwise to hierarchical and local BIO constraints, the method yields about an 18% reduction in error while maintaining practical runtime, and the penalty-learning process identifies the most impactful constraints. The approach is generalizable to other structured prediction tasks with global output regularities, offering a practical framework for leveraging rich, data-driven constraints without sacrificing tractability.

Abstract

Accurately segmenting a citation string into fields for authors, titles, etc. is a challenging task because the output typically obeys various global constraints. Previous work has shown that modeling soft constraints, where the model is encouraged, but not require to obey the constraints, can substantially improve segmentation performance. On the other hand, for imposing hard constraints, dual decomposition is a popular technique for efficient prediction given existing algorithms for unconstrained inference. We extend the technique to perform prediction subject to soft constraints. Moreover, with a technique for performing inference given soft constraints, it is easy to automatically generate large families of constraints and learn their costs with a simple convex optimization problem during training. This allows us to obtain substantial gains in accuracy on a new, challenging citation extraction dataset.

Learning Soft Linear Constraints with Application to Citation Field Extraction

TL;DR

The paper tackles citation field extraction by introducing soft global constraints into a base CRF via dual decomposition, coupled with a learning procedure that assigns penalties to constraint violations. This Soft-DD approach enables automatic constraint selection from large candidate sets and provides optimization certificates, achieving notable gains over a chain-structured CRF on a challenging dataset. By integrating constraint templates ranging from singleton and pairwise to hierarchical and local BIO constraints, the method yields about an 18% reduction in error while maintaining practical runtime, and the penalty-learning process identifies the most impactful constraints. The approach is generalizable to other structured prediction tasks with global output regularities, offering a practical framework for leveraging rich, data-driven constraints without sacrificing tractability.

Abstract

Accurately segmenting a citation string into fields for authors, titles, etc. is a challenging task because the output typically obeys various global constraints. Previous work has shown that modeling soft constraints, where the model is encouraged, but not require to obey the constraints, can substantially improve segmentation performance. On the other hand, for imposing hard constraints, dual decomposition is a popular technique for efficient prediction given existing algorithms for unconstrained inference. We extend the technique to perform prediction subject to soft constraints. Moreover, with a technique for performing inference given soft constraints, it is easy to automatically generate large families of constraints and learn their costs with a simple convex optimization problem during training. This allows us to obtain substantial gains in accuracy on a new, challenging citation extraction dataset.

Paper Structure

This paper contains 18 sections, 14 equations, 2 figures, 3 tables, 2 algorithms.

Figures (2)

  • Figure 1: Example labeled citation
  • Figure 2: Two examples where imposing soft global constraints improves field extraction errors. Soft-DD converged in 1 iteration on the first example, and 7 iterations on the second. When a reference is citing a book and not a section of the book, the correct labeling of the name of the book is title. In the first example, the baseline CRF incorrectly outputs booktitle, but this is fixed by Soft-DD, which penalizes outputs based on the constraint that booktitle should co-occur with an address label. In the second example, the unconstrained CRF output violates the constraint that title and status labels should not co-occur. The ground truth labeling also violates a constraint that title and language labels should not co-occur. At convergence of the Soft-DD algorithm, the correct labeling of language is predicted, which is possible because of the use of soft constraints.