The Problems with Proxies: Making Data Work Visible through Requester Practices
Annabel Rothschild, Ding Wang, Niveditha Jayakumar Vilvanathan, Lauren Wilcox, Carl DiSalvo, Betsy DiSalvo
TL;DR
This paper tackles the invisibility of data workers in AI data pipelines and how requester practices shape dataset quality. It reports on 52 interviews with data work requesters on platforms like MTurk, revealing that workers are treated as anonymous contributors and that many proxies for identity and aptitude lack reliability and fairness. The authors argue that data is designed through these requester practices, creating power imbalances and biased datasets, and they propose policy-oriented measures such as standardizing proxies, increasing transparency, involving workers in proxy design, auditing, and compensating pre-task work. The work emphasizes recognizing data workers as domain experts and reforming dataset sourcing policies to improve ethics and data integrity in AI systems.
Abstract
Fairness in AI and ML systems is increasingly linked to the proper treatment and recognition of data workers involved in training dataset development. Yet, those who collect and annotate the data, and thus have the most intimate knowledge of its development, are often excluded from critical discussions. This exclusion prevents data annotators, who are domain experts, from contributing effectively to dataset contextualization. Our investigation into the hiring and engagement practices of 52 data work requesters on platforms like Amazon Mechanical Turk reveals a gap: requesters frequently hold naive or unchallenged notions of worker identities and capabilities and rely on ad-hoc qualification tasks that fail to respect the workers' expertise. These practices not only undermine the quality of data but also the ethical standards of AI development. To rectify these issues, we advocate for policy changes to enhance how data annotation tasks are designed and managed and to ensure data workers are treated with the respect they deserve.
