Table of Contents
Fetching ...

Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision

Zhouhang Xie, Tushar Khot, Bhavana Dalvi Mishra, Harshit Surana, Julian McAuley, Peter Clark, Bodhisattwa Prasad Majumder

TL;DR

Instruct-LF addresses the challenge of discovering goal-aligned, interpretable latent factors from unstructured data by coupling LLM-driven property proposals with gradient-based latent factor modeling. The framework first builds a data-property matrix via a per-point property proposal and a dual-embedding link predictor, then clusters properties into latent factors using Linear Corex, yielding interpretable, task-relevant concepts. Across movie dialogues, Alfworld navigation logs, and American bill documents, Instruct-LF improves downstream task performance and outperforms state-of-the-art baselines, with human evaluators favoring its factors and groupings. The approach reduces reliance on strong LLM reasoning, scales to large noisy datasets, and offers a practical pipeline for goal-conditioned pattern discovery with measurable impact.

Abstract

Instruction-following LLMs have recently allowed systems to discover hidden concepts from a collection of unstructured documents based on a natural language description of the purpose of the discovery (i.e., goal). Still, the quality of the discovered concepts remains mixed, as it depends heavily on LLM's reasoning ability and drops when the data is noisy or beyond LLM's knowledge. We present Instruct-LF, a goal-oriented latent factor discovery system that integrates LLM's instruction-following ability with statistical models to handle large, noisy datasets where LLM reasoning alone falls short. Instruct-LF uses LLMs to propose fine-grained, goal-related properties from documents, estimates their presence across the dataset, and applies gradient-based optimization to uncover hidden factors, where each factor is represented by a cluster of co-occurring properties. We evaluate latent factors produced by Instruct-LF on movie recommendation, text-world navigation, and legal document categorization tasks. These interpretable representations improve downstream task performance by 5-52% than the best baselines and were preferred 1.8 times as often as the best alternative, on average, in human evaluation.

Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision

TL;DR

Instruct-LF addresses the challenge of discovering goal-aligned, interpretable latent factors from unstructured data by coupling LLM-driven property proposals with gradient-based latent factor modeling. The framework first builds a data-property matrix via a per-point property proposal and a dual-embedding link predictor, then clusters properties into latent factors using Linear Corex, yielding interpretable, task-relevant concepts. Across movie dialogues, Alfworld navigation logs, and American bill documents, Instruct-LF improves downstream task performance and outperforms state-of-the-art baselines, with human evaluators favoring its factors and groupings. The approach reduces reliance on strong LLM reasoning, scales to large noisy datasets, and offers a practical pipeline for goal-conditioned pattern discovery with measurable impact.

Abstract

Instruction-following LLMs have recently allowed systems to discover hidden concepts from a collection of unstructured documents based on a natural language description of the purpose of the discovery (i.e., goal). Still, the quality of the discovered concepts remains mixed, as it depends heavily on LLM's reasoning ability and drops when the data is noisy or beyond LLM's knowledge. We present Instruct-LF, a goal-oriented latent factor discovery system that integrates LLM's instruction-following ability with statistical models to handle large, noisy datasets where LLM reasoning alone falls short. Instruct-LF uses LLMs to propose fine-grained, goal-related properties from documents, estimates their presence across the dataset, and applies gradient-based optimization to uncover hidden factors, where each factor is represented by a cluster of co-occurring properties. We evaluate latent factors produced by Instruct-LF on movie recommendation, text-world navigation, and legal document categorization tasks. These interpretable representations improve downstream task performance by 5-52% than the best baselines and were preferred 1.8 times as often as the best alternative, on average, in human evaluation.

Paper Structure

This paper contains 43 sections, 5 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: An example of the output of our system, Instruct-LF, on discovering different types of user interest about movies from a conversation corpus.
  • Figure 2: The proposed framework. Instruct-LF generates a set of natural language property descriptions from data, i.e., documents (1a); then estimates the compatibility between each data point and each property (1b), and perform correlation-based grouping of properties to discover latent factors (2). The compatibility between each property is efficiently computed using a distilled dense text representation model. We provide additional details and examples in \ref{['fig:our_framework_detailed']} and \ref{['sec:our_framework_detailed']}.
  • Figure 3: LLMs are better property proposers than generators for GPT-3.5. See \ref{['fig:llm_are_better_generator_than_assigner_4o']} for results on GPT-4o, where the trend is consistent.
  • Figure 4: The proposed framework with concrete examples. See \ref{['sec:our_framework_detailed']} for discussion.
  • Figure 5: The instruction for our main human evaluation, results as shown in \ref{['tab:human_relevance_informativeness']}
  • ...and 3 more figures