Table of Contents
Fetching ...

Interaction Theater: A case of LLM Agents Interacting at Scale

Sarath Shekkizhar, Adam Earle

TL;DR

This study analyzes how autonomous LLM agents interact at scale on Moltbook to determine whether surface-level conversation translates into substantive information exchange. By combining lexical metrics, embedding-based semantic analysis, and LLM-as-judge validation, it reveals an interaction theater: agents generate diverse, well-formed text that superficially resembles engagement but largely fails to contribute new information or align with posts. Information saturates rapidly, most comments are generic or off-topic, and threaded conversations are rare, indicating limited turn-taking and collaborative reasoning. The findings underscore the need for explicit coordination mechanisms, structured protocols, and grounding in multi-agent systems to transform parallel output into productive collaboration with practical impact.

Abstract

As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding-based semantic similarity, and LLM-as-judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well-formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents ($67.5\%$) vary their output across contexts, $65\%$ of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam ($28\%$) and off-topic content ($22\%$). Embedding-based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation ($5\%$ of comments), defaulting instead to independent top-level responses. We discuss implications for multi-agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.

Interaction Theater: A case of LLM Agents Interacting at Scale

TL;DR

This study analyzes how autonomous LLM agents interact at scale on Moltbook to determine whether surface-level conversation translates into substantive information exchange. By combining lexical metrics, embedding-based semantic analysis, and LLM-as-judge validation, it reveals an interaction theater: agents generate diverse, well-formed text that superficially resembles engagement but largely fails to contribute new information or align with posts. Information saturates rapidly, most comments are generic or off-topic, and threaded conversations are rare, indicating limited turn-taking and collaborative reasoning. The findings underscore the need for explicit coordination mechanisms, structured protocols, and grounding in multi-agent systems to transform parallel output into productive collaboration with practical impact.

Abstract

As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding-based semantic similarity, and LLM-as-judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well-formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents () vary their output across contexts, of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam () and off-topic content (). Embedding-based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation ( of comments), defaulting instead to independent top-level responses. We discuss implications for multi-agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.
Paper Structure (39 sections, 8 equations, 7 figures, 2 tables)

This paper contains 39 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Dataset overview. (a) Comments per post distribution (median 4, heavy tail). (b) Comments per agent distribution (median 4). (c) Top 15 submolts by post count.
  • Figure 2: Agent behavioral entropy ($n=5{,}000$ agents with $\geq 10$ comments). (a) Self-NCD distribution (median 0.833): most agents vary their output across posts. (b) Token entropy distribution (median 8.36 bits). (c) Self-NCD vs. token entropy: a cluster of low-entropy, low-NCD template agents appears in the bottom-left. Results show that most agents on Moltbook produce highly varied output and might appear engaged based on this surface-level diversity alone.
  • Figure 3: Information saturation curves averaged over $20,000$ posts. (a) Lexical information gain: fraction of novel unigrams/bigrams at each comment position. (b) Compression-based information gain. (c) Cumulative unique vocabulary growth. All curves show steep initial gradient that flattens out, indicating rapid information saturation.
  • Figure 4: Post-comment relevance. (a) Content-word Jaccard similarity: comments show higher similarity to their actual post (blue) than to random posts (orange), but both distributions are concentrated near zero. (b) Lexical specificity distribution: a large mass at zero (generic comments) with a positive tail (post-specific comments). (c) Specificity increases with comment length, suggesting longer comments engage more with post content.
  • Figure 5: Semantic validation of lexical findings. (a) Lexical similarity distribution. (b) Embedding-based cosine similarity: comments are semantically closer to their actual post than to random posts, but the gap is modest. (c) Lexical vs. semantic specificity distributions. (d) Scatter: moderate positive correlation between the two metrics. (e) Semantic specificity increases with comment length. (f) Among lexically generic comments, semantic specificity remains near zero. These results confirm that most comments are generic and not specific to the post they appear under.
  • ...and 2 more figures