Table of Contents
Fetching ...

In your own words: computationally identifying interpretable themes in free-text survey data

Jenny S Wang, Aliya Saperstein, Emma Pierson

Abstract

Free-text survey responses can provide nuance often missed by structured questions, but remain difficult to statistically analyze. To address this, we introduce In Your Own Words, a computational framework for exploratory analyses of free-text survey data that identifies structured, interpretable themes in free-text responses more precisely than previous computational approaches, facilitating systematic analysis. To illustrate the benefits of this approach, we apply it to a new dataset of free-text descriptions of race, gender, and sexual orientation from 1,004 U.S. participants. The themes our approach learns have three practical applications in survey research. First, the themes can suggest structured questions to add to future surveys by surfacing salient constructs -- such as belonging and identity fluidity -- that existing surveys do not capture. Second, the themes reveal heterogeneity within standardized categories, explaining additional variation in health, well-being, and identity importance. Third, the themes illuminate systematic discordance between self-identified and perceived identities, highlighting mechanisms of misrecognition that existing measures do not reflect. More broadly, our framework can be deployed in a wide range of survey settings to identify interpretable themes from free text, complementing existing qualitative methods.

In your own words: computationally identifying interpretable themes in free-text survey data

Abstract

Free-text survey responses can provide nuance often missed by structured questions, but remain difficult to statistically analyze. To address this, we introduce In Your Own Words, a computational framework for exploratory analyses of free-text survey data that identifies structured, interpretable themes in free-text responses more precisely than previous computational approaches, facilitating systematic analysis. To illustrate the benefits of this approach, we apply it to a new dataset of free-text descriptions of race, gender, and sexual orientation from 1,004 U.S. participants. The themes our approach learns have three practical applications in survey research. First, the themes can suggest structured questions to add to future surveys by surfacing salient constructs -- such as belonging and identity fluidity -- that existing surveys do not capture. Second, the themes reveal heterogeneity within standardized categories, explaining additional variation in health, well-being, and identity importance. Third, the themes illuminate systematic discordance between self-identified and perceived identities, highlighting mechanisms of misrecognition that existing measures do not reflect. More broadly, our framework can be deployed in a wide range of survey settings to identify interpretable themes from free text, complementing existing qualitative methods.

Paper Structure

This paper contains 24 sections, 3 equations, 14 figures, 28 tables.

Figures (14)

  • Figure 1: Overview of data collection (a) and computational framework (b -- e).(a): To create the dataset analyzed via the In Your Own Words framework, we ask participants to describe their race, gender, and sexual orientation in free text. (b): We convert the free-text responses into embedding vectors that capture their semantic meaning but are not readily interpretable. (c): We then use a sparse autoencoder (SAE) to extract more interpretable dimensions from the embeddings; each dimension captures a recurring pattern in how identity is expressed, such as references to cultural heritage, language, or childhood experiences. (d) To produce a text interpretation of each dimension, we prompt a large language model (LLM) to identify the common theme among the free-text responses that score highly along that dimension. (e): We then use an LLM to annotate each free-text response for whether it contains each theme. Themes described in the illustration are abbreviated for space.
  • Figure 2: Free-text themes produced by our computational framework. Each row shows one free-text theme, and colored bars indicate the standardized category of respondents whose free-text responses contain the theme. While some themes are predominantly associated with a single category (e.g., "mentions never questioning sexual orientation" is mostly expressed by straight respondents) many themes cut across categories within each identity axis. Very rare standardized categories (with fewer than 10 respondents) are excluded. For space, free-text themes are abbreviated, and only a subset of themes are shown; the full text of all themes is provided in Figures \ref{['fig:race-themes-barchart']}–\ref{['fig:sexual-orientation-themes']}.
  • Figure S1: Self-described identity: Distribution of word counts in participants’ free-text responses about their race, gender, and sexual orientation. Dashed lines indicate median word count for each identity axis. Annotated quotes highlight the richness in information the median respondent is able to provide.
  • Figure S2: Perceived identity: Distribution of word counts in participants’ free-text responses about their perceived race, gender, and sexual orientation. Dashed lines indicate median word count for each identity axis. Annotated quotes highlight the richness in information the median respondent is able to provide.
  • Figure S3: Respondents with minority identities report greater added value from free text and write longer responses than individuals with non-minority identities. Across both panels, reference (non-minority) categories are White for race, both Cisgender Man and Cisgender Woman for gender, and Straight or heterosexual for sexual orientation; all comparisons are computed within identity axis. Left: Odds ratios from identity-specific logistic regressions predicting answering "Yes" to: Do you feel that your free-text response provides important details about your [identity] that were not captured by the multiple-choice questions?Right: Percent change in free-text word count relative to the corresponding reference category (within each identity axis). Categories with $\leq$15 responses are omitted from category-specific regressions due to unstable estimates but are included in the "Minority" vs. "Not Minority" regression.
  • ...and 9 more figures