Table of Contents
Fetching ...

A Decade of News Forum Interactions: Threaded Conversations, Signed Votes, and Topical Tags

Emma Fraxanet, Vicenç Gómez, Andreas Kaltenbrunner, Max Pellert

TL;DR

A large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper, providing structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy.

Abstract

We present a large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper. The dataset spans ten years (2013-2022) and includes over 75 million user comments, more than 400 million votes, and detailed metadata on articles and user interactions. It provides structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy. To ensure this privacy, all persistent identifiers are anonymized using salted hash functions, and the raw comment texts are not publicly shared. Instead, we release pre-computed vector representations derived from a state-of-the-art embedding model. The dataset supports research on discussion dynamics, network structures, and semantic analyses in the mid-resourced language German, offering a reusable resource across computational social science and related fields.

A Decade of News Forum Interactions: Threaded Conversations, Signed Votes, and Topical Tags

TL;DR

A large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper, providing structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy.

Abstract

We present a large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper. The dataset spans ten years (2013-2022) and includes over 75 million user comments, more than 400 million votes, and detailed metadata on articles and user interactions. It provides structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy. To ensure this privacy, all persistent identifiers are anonymized using salted hash functions, and the raw comment texts are not publicly shared. Instead, we release pre-computed vector representations derived from a state-of-the-art embedding model. The dataset supports research on discussion dynamics, network structures, and semantic analyses in the mid-resourced language German, offering a reusable resource across computational social science and related fields.

Paper Structure

This paper contains 7 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Example of the DerStandard comment platform interface. The screenshot shows a user comment with its associated voting summary in the upper right corner, indicating the number of upvotes (green) and downvotes (red). By hovering over the bar, users can see a breakdown of who voted and how (up- or downvote). Comments are displayed in threaded format with timestamps and reply links. We blurred user names and blacked out comment bodies and titles in this example. Users can also share and flag a comment as inappropriate (by clicking on the share and flag icons, respectively).
  • Figure 2: Most discussed topics across DerStandard's internal tag hierarchy. The diagram shows the internal tag hierarchy used by DerStandard editors to classify articles, highlighting the tags associated with the highest total number of user comments between 2013 and 2022. Flows between columns indicate how general categories are structured into more specific topics, proportional to the number of comments going to each subtopic. Node size indicates the volume of comments in each category. We color nodes according to the top-level category that they belong to. The "other' category represents any top- or mid-level tags that are not in the diagram due to having significantly lower comment counts than the ones displayed.
  • Figure 3: Example forum conversations from three different perspectives. We illustrate two thread discussions on the platform using three visualizations: (I) the thread structure, showing the hierarchical relation between comments (with comments by the original comment author in green); (II) the reply network among users (highlighting the original post author in green); and (III) the vote network between users, including both commenting users (in yellow) and users who only voted (in gray), as well as the original author (in green). Example A shows a larger, mostly positive discussion, while Example B features a smaller, more decentralized discussion with a polarized voting pattern.
  • Figure 4: Overview of the data records included in the DerStandard dataset. The dataset is structured into several files based on record type and temporal granularity. A single Users file contains anonymized metadata per user, including activity statistics and voting behavior. Comments, Votes, and Embeddings are provided as monthly files, capturing all user-generated content, up- and downvotes, and pre-computed text embeddings, respectively. Articles files are aggregated yearly and include metadata such as timestamps, and up to three editorial topic labels per article. Summary Files contains auxiliary information, including aggregate statistics, data quality annotations, and mappings to previous work. All variable names coincide with the column header in the files. Additional explanations of some variables are provided in parentheses.
  • Figure 5: Validation of comment embeddings based on discussion structure. Cosine similarity distributions for different types of comment pairs. Similarity is highest for direct reply pairs, followed by comments within the same thread, and then comments under the same article (excluding direct replies). Cross-thread and random pairs show the lowest similarity, indicating that semantic proximity in the embeddings aligns with structural proximity in discussions.
  • ...and 1 more figures