CoGen: Learning from Feedback with Coupled Comprehension and Generation

Mustafa Omer Gul; Yoav Artzi

CoGen: Learning from Feedback with Coupled Comprehension and Generation

Mustafa Omer Gul, Yoav Artzi

TL;DR

This work investigates coupling language comprehension and generation in a continual learning setting where a single model interacts with humans in two-player reference games. By combining joint inference and data sharing, and training with feedback signals via a contextual-bandit REINFORCE objective, the system achieves substantial, time-evolving gains in both comprehension and generation, while producing language that aligns more closely with human discourse. The approach leverages a pragmatic, RSA-inspired inference stance and introduces data-sharing across roles to inject human language into generator training, resulting in improved data efficiency and more diverse, human-like utterances. The findings demonstrate a viable pathway for interactive AI that learns from user interactions, with practical implications for scalable, human-aligned language systems.

Abstract

Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system's language, making it significantly more human-like.

CoGen: Learning from Feedback with Coupled Comprehension and Generation

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 9 figures)

This paper contains 46 sections, 4 equations, 9 figures.

Introduction
Interaction Scenario and Overview
Interaction Scenario
Deployment
Inference and Learning
Evaluation
Continual Learning
Feedback Collection
Learning
Coupling Comprehension and Generation
Learning with Data Sharing
Joint Inference
Experimental Setup
Game Construction
Model and Initialization
...and 31 more sections

Figures (9)

Figure 1: Illustration of our reference game interaction scenario involving a speaker and listener. Each game includes a single turn. Speakers are assigned a target image and write a description such that their partner can guess the image from the description. The game succeeds if the listener guesses correctly. We deploy our models (gray bot) as speaker to interact with human listeners (top) or vice versa (bottom).
Figure 2: Illustration of our continual learning scenario with coupled comprehension and generation. The process alternates between interactions with human partners in a reference game, and training using learning signals from the interactions. The model performs both the generation (left) and comprehension (right) tasks, while jointly reasoning over the other role (thought bubbles). Training leverages feedback for the role the model performs as well as the opposing role. Following each round of training, we re-deploy the updated model and repeat the process.
Figure 3: Comprehension and generation performance for system variants across four rounds of deployment, with 95% confidence intervals. The top $x$-axis indicates the total number of interactions collected for a role up to the deployment round. Coupling comprehension and generation leads to Full outperforming all ablations throughout.
Figure 4: Model comprehension and generation accuracy when the speaker utterance includes ( ) and does not include ( ) words for spatial reasoning.
Figure 5: Language analysis plots, with 95% confidence intervals.\ref{['fn:lang_analysis_confidence']} Trends in utterance length mirror that of humans when using data sharing (Full and No-JI). Full possesses the highest effective vocabulary size and produces the largest number of new words each round. The Full system additionally shows an increase in MAUVE scores ($\uparrow$) over time and exhibits the highest SND ($\uparrow$) throughout.
...and 4 more figures

CoGen: Learning from Feedback with Coupled Comprehension and Generation

TL;DR

Abstract

CoGen: Learning from Feedback with Coupled Comprehension and Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)