Table of Contents
Fetching ...

Enriching Social Science Research via Survey Item Linking

Tornike Tsereteli, Daniel Ruffinelli, Simone Paolo Ponzetto

TL;DR

This paper tackles Survey Item Linking (SIL), a two-stage task to map implicit mentions of survey items in social science publications to a corpus of survey items in a knowledge base. It introduces a high-quality bilingual dataset (SILD) with 20,454 English/German sentences and a richly annotated scheme differentiating variable-level concepts from specific survey items, enabling fine-grained evaluation of mention detection and entity disambiguation. The authors benchmark a range of classical and neural methods, develop domain-adapted language models for social science text, and demonstrate that while linking is feasible, errors propagate across stages and context matters, especially for implicit mentions. They further create a large knowledge base (GSIM) of survey item metadata and show that synthetic data and domain-specific sentence embeddings can significantly boost performance, with recall improvements up to substantial margins when using top-k predictions. The work lays groundwork for end-to-end SIL and highlights practical implications for improving findability and interoperability in social science research, with open data and code to support future development.

Abstract

Questions within surveys, called survey items, are used in the social sciences to study latent concepts, such as the factors influencing life satisfaction. Instead of using explicit citations, researchers paraphrase the content of the survey items they use in-text. However, this makes it challenging to find survey items of interest when comparing related work. Automatically parsing and linking these implicit mentions to survey items in a knowledge base can provide more fine-grained references. We model this task, called Survey Item Linking (SIL), in two stages: mention detection and entity disambiguation. Due to an imprecise definition of the task, existing datasets used for evaluating the performance for SIL are too small and of low-quality. We argue that latent concepts and survey item mentions should be differentiated. To this end, we create a high-quality and richly annotated dataset consisting of 20,454 English and German sentences. By benchmarking deep learning systems for each of the two stages independently and sequentially, we demonstrate that the task is feasible, but observe that errors propagate from the first stage, leading to a lower overall task performance. Moreover, mentions that require the context of multiple sentences are more challenging to identify for models in the first stage. Modeling the entire context of a document and combining the two stages into an end-to-end system could mitigate these problems in future work, and errors could additionally be reduced by collecting more diverse data and by improving the quality of the knowledge base. The data and code are available at https://github.com/e-tornike/SIL .

Enriching Social Science Research via Survey Item Linking

TL;DR

This paper tackles Survey Item Linking (SIL), a two-stage task to map implicit mentions of survey items in social science publications to a corpus of survey items in a knowledge base. It introduces a high-quality bilingual dataset (SILD) with 20,454 English/German sentences and a richly annotated scheme differentiating variable-level concepts from specific survey items, enabling fine-grained evaluation of mention detection and entity disambiguation. The authors benchmark a range of classical and neural methods, develop domain-adapted language models for social science text, and demonstrate that while linking is feasible, errors propagate across stages and context matters, especially for implicit mentions. They further create a large knowledge base (GSIM) of survey item metadata and show that synthetic data and domain-specific sentence embeddings can significantly boost performance, with recall improvements up to substantial margins when using top-k predictions. The work lays groundwork for end-to-end SIL and highlights practical implications for improving findability and interoperability in social science research, with open data and code to support future development.

Abstract

Questions within surveys, called survey items, are used in the social sciences to study latent concepts, such as the factors influencing life satisfaction. Instead of using explicit citations, researchers paraphrase the content of the survey items they use in-text. However, this makes it challenging to find survey items of interest when comparing related work. Automatically parsing and linking these implicit mentions to survey items in a knowledge base can provide more fine-grained references. We model this task, called Survey Item Linking (SIL), in two stages: mention detection and entity disambiguation. Due to an imprecise definition of the task, existing datasets used for evaluating the performance for SIL are too small and of low-quality. We argue that latent concepts and survey item mentions should be differentiated. To this end, we create a high-quality and richly annotated dataset consisting of 20,454 English and German sentences. By benchmarking deep learning systems for each of the two stages independently and sequentially, we demonstrate that the task is feasible, but observe that errors propagate from the first stage, leading to a lower overall task performance. Moreover, mentions that require the context of multiple sentences are more challenging to identify for models in the first stage. Modeling the entire context of a document and combining the two stages into an end-to-end system could mitigate these problems in future work, and errors could additionally be reduced by collecting more diverse data and by improving the quality of the knowledge base. The data and code are available at https://github.com/e-tornike/SIL .

Paper Structure

This paper contains 74 sections, 3 equations, 13 figures, 27 tables.

Figures (13)

  • Figure 1: An example research publication (left) with sentences mentioning survey items Kino2017 and a list of candidate survey items from different surveys (including the survey ID, the item ID, and the item label, which summarizes its content). The arrows indicate that a sentence mentions a survey item and should be linked. Some survey items from different surveys in the list are very similar to each other (e.g., v356 and qa7a). These are difficult to disambiguate without document-level context or knowledge about relevant surveys
  • Figure 2: Two-stage pipeline for survey item linking. First, the MD stage identifies relevant sentences in a publication (left). The ED stage then links relevant survey items from surveys to each identified sentence (right)
  • Figure 3: An example of sentences from a research publication Kino2017 that studies the concept of health-related behaviour using a number of survey items. The arrows indicate that a sentence mentions a survey item (highlighted in yellow) and should be linked. Each survey item contains metadata, such as the survey ID, item ID, label (summarizing its content), question (posed to participants), relevant item category from a set of categories (supplementary detail that is used in combination with the question), and answer categories (one of which is used as a response)
  • Figure 4: An example of sentences from a research publication Kino2017 that studies the concept of health-related behaviour using a number of survey items. The highlighted sentences cover different diagnostic elements: concept (purple), explicit sentence type (blue), and implicit sentence type (red). The concept is linked with purple arrows to the sentences mentioning survey items that are used to operationalize it. Yellow arrows show the contextual dependence of implicit types, which require additional context to disambiguate
  • Figure 5: Comparison between SILD and GSIM with respect to the number of surveys published each year
  • ...and 8 more figures