Table of Contents
Fetching ...

Do language models capture implied discourse meanings? An investigation with exhaustivity implicatures of Korean morphology

Hagyeong Shin, Sean Trott

TL;DR

The study investigates whether distributional semantics in large language models can encode discourse-level meanings associated with Korean Differential Object Marking (DOM), focusing on lul, nun, and null-marking. It conducts processing and production experiments across several models (KoGPT variants, Polyglot-Ko, GPT-3, ChatGPT), using surprisal, ratings, log-probabilities, and forced-choice tasks, analyzed with mixed-effects models. Findings show that some large models (notably GPT-3 and Polyglot-Ko-12B) exhibit partial sensitivity to exhaustivity implicatures, especially for the nun marker, but encoding dual meanings across markers remains inconsistent and challenging, with lul less likely to encode discourse meaning. The results suggest that distributional semantics alone provide only a baseline for discourse pragmatics in DID Korean DOM, and improvements via scaling and human feedback hint at potential but do not fully replicate human-like discourse interpretation.

Abstract

Markedness in natural language is often associated with non-literal meanings in discourse. Differential Object Marking (DOM) in Korean is one instance of this phenomenon, where post-positional markers are selected based on both the semantic features of the noun phrases and the discourse features that are orthogonal to the semantic features. Previous work has shown that distributional models of language recover certain semantic features of words -- do these models capture implied discourse-level meanings as well? We evaluate whether a set of large language models are capable of associating discourse meanings with different object markings in Korean. Results suggest that discourse meanings of a grammatical marker can be more challenging to encode than that of a discourse marker.

Do language models capture implied discourse meanings? An investigation with exhaustivity implicatures of Korean morphology

TL;DR

The study investigates whether distributional semantics in large language models can encode discourse-level meanings associated with Korean Differential Object Marking (DOM), focusing on lul, nun, and null-marking. It conducts processing and production experiments across several models (KoGPT variants, Polyglot-Ko, GPT-3, ChatGPT), using surprisal, ratings, log-probabilities, and forced-choice tasks, analyzed with mixed-effects models. Findings show that some large models (notably GPT-3 and Polyglot-Ko-12B) exhibit partial sensitivity to exhaustivity implicatures, especially for the nun marker, but encoding dual meanings across markers remains inconsistent and challenging, with lul less likely to encode discourse meaning. The results suggest that distributional semantics alone provide only a baseline for discourse pragmatics in DID Korean DOM, and improvements via scaling and human feedback hint at potential but do not fully replicate human-like discourse interpretation.

Abstract

Markedness in natural language is often associated with non-literal meanings in discourse. Differential Object Marking (DOM) in Korean is one instance of this phenomenon, where post-positional markers are selected based on both the semantic features of the noun phrases and the discourse features that are orthogonal to the semantic features. Previous work has shown that distributional models of language recover certain semantic features of words -- do these models capture implied discourse-level meanings as well? We evaluate whether a set of large language models are capable of associating discourse meanings with different object markings in Korean. Results suggest that discourse meanings of a grammatical marker can be more challenging to encode than that of a discourse marker.
Paper Structure (16 sections, 7 figures, 4 tables)

This paper contains 16 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Evaluations of discourse continuations where the exhaustivity status implied by lul and nun is canceled (with repeated continuations) or not canceled (with negated continuations), and where the previous sentence is logically contradicted (dashed line). Mean surprisals and 95% CIs gathered from the 6 models are presented on the left. Higher surprisals indicate that the model had lower expectations for encountering the continuation. On the right side, mean of z-transformed appropriateness ratings and 95% CIs from ChatGPT and 34 native Korean speakers are presented. Higher ratings indicate that the discourse continuation was evaluated as more felicitous. Stars indicate adjusted significance levels obtained from paired $t$-tests with Bonferroni corrections (****: $p < 0.001$, **: $p < 0.005$, *: $p < 0.05$).
  • Figure 2: Mean log probabilities assigned to lul-, nun-, and null-marked responses when non-exhaustive or exhaustive messages are intended are shown. Error bars indicate 95% CIs. Dashed horizontal lines indicate the mean log probabilities assigned to contradictory responses, such as "Received only the medal" when a non-exhaustive message (both the medal and the trophy) is intended, or "Received both the trophy and the medal" when an exhaustive message (only the medal) is intended. Solid horizontal lines indicate the mean log probabilities assigned to verbatim responses, such as "Received both the trophy and the medal" when a non-exhaustive message (both the medal and the trophy) is intended, or "Received only the trophy" when an exhaustive message (only the medal) is intended.
  • Figure 3: Proportions of responses elicited from ChatGPT and from 35 human participants. The left panels (marked = lul) summarize choices when the response sets included lul-marked and null-marked objects. Here, the 'marked' proportion, colored in red, indicates the proportion of lul-marked responses, while the 'unmarked' proportion, in blue, indicates the proportion of null-marked responses. On the right panels (marked = nun), choices are summarized when response sets included nun-marked and null-marked objects. Here, the 'marked' proportion, colored in red, indicates the proportion of nun-marked responses, while the 'unmarked' proportion, in blue, indicates the proportion of null-marked responses.
  • Figure 4: The question and answer portions in the human experiment items were presented in an interface resembling that of mobile text messages. Each message appeared in a 3-second interval within a short video clip.
  • Figure 5: Raw ratings obtained from ChatGPT in Experiment 1 (1 = not approriate at all, 7 = highly appropriate).
  • ...and 2 more figures