Big Tech influence over AI research revisited: memetic analysis of attribution of ideas to affiliation

Stanisław Giziński; Paulina Kaczyńska; Hubert Ruczyński; Emilia Wiśnios; Bartosz Pieliński; Przemysław Biecek; Julian Sienkiewicz

Big Tech influence over AI research revisited: memetic analysis of attribution of ideas to affiliation

Stanisław Giziński, Paulina Kaczyńska, Hubert Ruczyński, Emilia Wiśnios, Bartosz Pieliński, Przemysław Biecek, Julian Sienkiewicz

TL;DR

This paper investigates whether Big Tech dominates AI research by analyzing how ideas (memes) propagate through citation networks, rather than mere publication counts. It introduces meme score and conditioned sticking factor to quantify meme contagiousness and its dependence on author affiliations, using OpenAlex and S2ORC to scale across a broad corpus. The results challenge simplistic narratives of Big Tech control: while Big Tech–affiliated work can be highly cited and memes vary in contagiousness, mixed affiliations often account for the most influential papers, and Big Tech memes show heightened spread in certain domains. The work advances a nuanced view of academia–industry dynamics and offers methodological tools for tracking how organizational affiliation shapes the framing and diffusion of AI ideas, with implications for research policy and collaboration strategies.

Abstract

There exists a growing discourse around the domination of Big Tech on the landscape of artificial intelligence (AI) research, yet our comprehension of this phenomenon remains cursory. This paper aims to broaden and deepen our understanding of Big Tech's reach and power within AI research. It highlights the dominance not merely in terms of sheer publication volume but rather in the propagation of new ideas or memes. Current studies often oversimplify the concept of influence to the share of affiliations in academic papers, typically sourced from limited databases such as arXiv or specific academic conferences. The main goal of this paper is to unravel the specific nuances of such influence, determining which AI ideas are predominantly driven by Big Tech entities. By employing network and memetic analysis on AI-oriented paper abstracts and their citation network, we are able to grasp a deeper insight into this phenomenon. By utilizing two databases: OpenAlex and S2ORC, we are able to perform such analysis on a much bigger scale than previous attempts. Our findings suggest that while Big Tech-affiliated papers are disproportionately more cited in some areas, the most cited papers are those affiliated with both Big Tech and Academia. Focusing on the most contagious memes, their attribution to specific affiliation groups (Big Tech, Academia, mixed affiliation) seems equally distributed between those three groups. This suggests that the notion of Big Tech domination over AI research is oversimplified in the discourse.

Big Tech influence over AI research revisited: memetic analysis of attribution of ideas to affiliation

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 9 figures, 2 tables)

This paper contains 18 sections, 2 equations, 9 figures, 2 tables.

Introduction
Related Work
Big tech influence over AI research
Operationalization of the concept of Big Tech papers
Memetic analysis
Materials and methods
Meme score
Conditioned sticking factor
Papers dataset and processing
Results
Non-binary distinction between Big Tech and academy--affiliated papers
Most contagious ideas in AI research
Differences in contagiousness between Companies, Academia, and Big Tech
Discussion and conclusions
Data gathering and processing
...and 3 more sections

Figures (9)

Figure 1: Toy example of how to calculate the meme score. Papers 1, 2 and 3 contain the meme (maroon background). Papers 1 and 3 have a given affiliation (dark blue frame) for which we calculate the conditioned sticking factor. The frequency of the meme is $\frac{3}{6}$. The unconditioned sticking factor is $\frac{1}{3}$, because of three papers that cite papers with the meme (2, 4, and 5), only one has the meme. The sparking factor is $\frac{2}{3}$ since the meme appears in two of three papers that do not cite papers with this meme – it appears in 1 and 3, but not in 6. The conditioned sticking factor is $\frac{1}{2}$ since the meme appears in 1 out of 2 citations of the paper with the meme and given affiliation (2 replicates the meme from 1, 5 does not replicate the meme from 3).
Figure 2: Histogram (log scale) of the number of papers for different fractions of Big Tech affiliations. The length of each bin is 0.025. Most papers are affiliated only with Academia, yet we can also witness a visible peak for purely Big Tech papers, which underlines the validity of this group. The plot also suggests that we should consider a third group of mixed affiliations, as the number of articles with Big Tech fraction is even bigger than that of Big Tech papers.
Figure 3: Differences between Academia--mixed Big Tech--Big Tech as seen from the citation network perspective. The rows are, respectively nodes' in-degrees (panels a, b) and the PageRank values (panels c, d). The left column shows the distribution plot, presenting the percentage of nodes with a given in-degree or PageRank value for ternary classification ("Academia", "Mixed", and "Big Tech") as well as cumulative quartiles (see text for details). Vertical solid lines on the far right of panels a and c connect statistically indistinguishable categories, while vertical dotted lines extend solid lines for the case of the network with removed isolated nodes. The right column presents probability density distributions associated with the ternary classification on panels a and b: filled squares represent Academia, empty circles -- mixed Big Tech, and filled circles -- Big Tech (node in-degree is increased by one to overcome log scale issues).
Figure 4: Differences between Academia--mixed Company--Company as seen from the citation network perspective. The rows are, respectively, nodes' in-degrees (panels a, b) and the PageRank values (panels c, d). The left column shows the distribution plot, presenting the percentage of nodes with a given in-degree or PageRank value for ternary classification ("Academia", "Mixed", and "Company") as well as cumulative quartiles (see text for details). Vertical solid lines on the far right of panels a and c connect statistically indistinguishable categories, while vertical dotted lines extend solid lines for the case of the network with removed isolated nodes. The right column presents probability density distributions associated with the ternary classification on panels a and b: filled squares represent Academia, empty circles -- mixed Company, and filled circles -- Company (node in-degree is increased by one to overcome log scale issues).
Figure 5: Top memes selection by thresholds. The left plot shows the number of remaining phrases, depending on the minimum number of meme occurrences. Since the dispersion of observations occurs around the number 20 on the x-axis, and 1000 on the y-axis (marked as dashed lines), we decided to use this point as a cutoff, which resulted in limiting ourselves to over 20 000 observations that have more than 20 appearances. The right plot shows the meme score value of the observation as a function of its position in the list sorted by the meme score. Since the major flattening occurs around the meme score value of 0.25, we decided to use this point as a cutoff, which resulted in limiting ourselves to 251 observations (marked as dashed lines).
...and 4 more figures

Big Tech influence over AI research revisited: memetic analysis of attribution of ideas to affiliation

TL;DR

Abstract

Big Tech influence over AI research revisited: memetic analysis of attribution of ideas to affiliation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)