Table of Contents
Fetching ...

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Santiago Herrera, Caio Corro, Sylvain Kahane

TL;DR

The paper tackles automatic extraction of descriptive grammar rules from treebanks by formalizing rules as probabilistic constraints and using sparse logistic regression with a high-order feature space. It relies on a regularization path to rank rule saliency and evaluates on multiple languages (Spanish, French, Wolof, with English examples) to uncover both known and novel agreement and word-order patterns. The results show richer, more interpretable rule sets than prior work, with quantitative ranking supported by statistical tests and correlation analyses, contributing to bridging computational and theoretical linguistics and aiding language documentation.

Abstract

Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

TL;DR

The paper tackles automatic extraction of descriptive grammar rules from treebanks by formalizing rules as probabilistic constraints and using sparse logistic regression with a high-order feature space. It relies on a regularization path to rank rule saliency and evaluates on multiple languages (Spanish, French, Wolof, with English examples) to uncover both known and novel agreement and word-order patterns. The results show richer, more interpretable rule sets than prior work, with quantitative ranking supported by statistical tests and correlation analyses, contributing to bridging computational and theoretical linguistics and aiding language documentation.

Abstract

Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
Paper Structure (18 sections, 8 equations, 2 figures, 3 tables)

This paper contains 18 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example from the SUD version of the English GUM treebank. Two clauses with subjects in different positions. In the main clause, the subject follows the verb and the first position is filled by an expletive. In the subordinate clause, the subject occupies the dominant pre-verbal position.
  • Figure 2: The search space is defined around a governor node (gov) and a dependent node (dep), including the grandparent's governor (granparent) and the governor's other dependents (siblings), as well as the dependent's children (grandchildren).

Theorems & Definitions (1)

  • definition 1: Syntactic grammar rule