Table of Contents
Fetching ...

To be Continuous, or to be Discrete, Those are Bits of Questions

Yiran Wang, Masao Utiyama

TL;DR

This work introduces binary, bit-level outputs for structured prediction by extending CKY to handle binary labels and by formulating a span-marginal similarity that combines label and structural information. It unifies parsing and hashing under a single structured contrastive learning objective, deploying a max-based instance selection loss to overcome the geometric center issue. Empirical results on constituency parsing and nested NER show competitive performance using only a small number of bits (around 12 for parsing and 8 for NER), underscoring memory and efficiency gains and revealing implicit label clustering within codes. The approach offers a versatile pathway for bridging continuous deep learning representations with the discrete nature of natural language, with potential impact on scalable, interpretable NLP models.

Abstract

Recently, binary representation has been proposed as a novel representation that lies between continuous and discrete representations. It exhibits considerable information-preserving capability when being used to replace continuous input vectors. In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead. To preserve the structural information on the output side along with label information, we extend the previous contrastive hashing method as structured contrastive hashing. More specifically, we upgrade CKY from label-level to bit-level, define a new similarity function with span marginal probabilities, and introduce a novel contrastive loss function with a carefully designed instance selection strategy. Our model achieves competitive performance on various structured prediction tasks, and demonstrates that binary representation can be considered a novel representation that further bridges the gap between the continuous nature of deep learning and the discrete intrinsic property of natural languages.

To be Continuous, or to be Discrete, Those are Bits of Questions

TL;DR

This work introduces binary, bit-level outputs for structured prediction by extending CKY to handle binary labels and by formulating a span-marginal similarity that combines label and structural information. It unifies parsing and hashing under a single structured contrastive learning objective, deploying a max-based instance selection loss to overcome the geometric center issue. Empirical results on constituency parsing and nested NER show competitive performance using only a small number of bits (around 12 for parsing and 8 for NER), underscoring memory and efficiency gains and revealing implicit label clustering within codes. The approach offers a versatile pathway for bridging continuous deep learning representations with the discrete nature of natural language, with potential impact on scalable, interpretable NLP models.

Abstract

Recently, binary representation has been proposed as a novel representation that lies between continuous and discrete representations. It exhibits considerable information-preserving capability when being used to replace continuous input vectors. In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead. To preserve the structural information on the output side along with label information, we extend the previous contrastive hashing method as structured contrastive hashing. More specifically, we upgrade CKY from label-level to bit-level, define a new similarity function with span marginal probabilities, and introduce a novel contrastive loss function with a carefully designed instance selection strategy. Our model achieves competitive performance on various structured prediction tasks, and demonstrates that binary representation can be considered a novel representation that further bridges the gap between the continuous nature of deep learning and the discrete intrinsic property of natural languages.
Paper Structure (18 sections, 18 equations, 3 figures, 5 tables)

This paper contains 18 sections, 18 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The model architecture. The attention hash layer produces span scores (pink circles), we only use the upper triangular part of these scores and feed them into the bit-level CKY to obtain the marginal probabilities of all valid spans (purple circles). During training, we only select the spans on the target trees for structured contrastive hashing and leave the other spans unused (transparent purple circles). During inference, as shown at the bottom, our model parses sentences by returning trees with label codes (hexadecimal numbers), which are then translated back to the original labels.
  • Figure 2: An example of the geometric center issue. Orange circles are positive to the black circle instance, while the dotted orange circle is their geometric center. The difference between $\loss_{\text{sup}}$ and our $\loss_{\text{max}}$ is that we target the closest positive instead of their geometric center.
  • Figure 3: Examples of the hashing and constituency parsing results. The hexadecimal numbers in the brackets indicate the generated binary codes, and the span labels are translated from them.