Table of Contents
Fetching ...

CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing

Leonie Bossemeyer, Samuel Heinrich, Grant Van Horn, Oisin Mac Aodha

TL;DR

CleverBirds tackles the challenge of modeling fine-grained visual knowledge tracing by introducing a large-scale, real-world bird species quiz dataset drawn from eBird, comprising over $17.9$ million interactions across more than $10{,}000$ species and more than $40{,}000$ participants. The authors formalize a flexible problem setting where learner responses are predicted from image embeddings, question context, and a历史 interaction window of size $W$, evaluating multiple contexts (User, Species, Image) and a broad set of baselines, including transformer KT models and traditional classifiers. Key findings show that engineered context and image features substantially boost predictive performance; however, standard KT models offer limited gains on this dataset, highlighting the need for stronger long-range and cross-concept modeling for visual KT at scale. The dataset and evaluation framework facilitate studying how visual expertise develops over time and across individuals, enabling new methodological directions for teaching and understanding fine-grained visual recognition.

Abstract

Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner's knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners' knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.

CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing

TL;DR

CleverBirds tackles the challenge of modeling fine-grained visual knowledge tracing by introducing a large-scale, real-world bird species quiz dataset drawn from eBird, comprising over million interactions across more than species and more than participants. The authors formalize a flexible problem setting where learner responses are predicted from image embeddings, question context, and a历史 interaction window of size , evaluating multiple contexts (User, Species, Image) and a broad set of baselines, including transformer KT models and traditional classifiers. Key findings show that engineered context and image features substantially boost predictive performance; however, standard KT models offer limited gains on this dataset, highlighting the need for stronger long-range and cross-concept modeling for visual KT at scale. The dataset and evaluation framework facilitate studying how visual expertise develops over time and across individuals, enabling new methodological directions for teaching and understanding fine-grained visual recognition.

Abstract

Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner's knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners' knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.

Paper Structure

This paper contains 17 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (Left) Human Learning. Participants learn from the quiz questions contained in CleverBirds through repeated interactions. For each question, participants are presented with an image of a bird species and a list of possible species names (here {'A', 'B', 'C', 'D'}), which may include the correct answer. After making a guess, they receive feedback in the form of the correct answer (here 'A'). This process is repeated for multiple questions. (Right) Knowledge Tracing. We illustrate the prediction task, in which a model is given a participant’s interaction history together with the current question’s image, options, and correct answer, and is tasked with predicting the participant’s guess.
  • Figure 2: Three examples of the types of quiz questions found in our CleverBirds dataset. In each case, there are four options representing different species and an additional "None of the above option". The correct answer is indicated in green. Any of five options are valid answers and the set of candidate species provided in the option set are different for each question.
  • Figure 3: Left to right: Cumulative distribution of quizzes attempted per user on a log scale, distribution of users' average accuracies, distribution of species-wise average user accuracies, and average user accuracy by number of prior exposures to a species.
  • Figure 4: (Left) Here we compare lower quality quiz images (upper row) to high quality ones obtained from eBird species' pages (bottom row). Quiz questions may contain images that show birds from a distance, partially obscured, or uncommon angles. Species from left to right: Bufflehead, California Towhee, Dark-eyed Junco, and Blue-gray Gnatcatcher. (Right) Here we show the average accuracy of users for each possible quality rating. We observe that on average that higher quality images are easier for users.
  • Figure 5: Top-5 most frequently confused species pairs for species with > 1,000 interactions. From top-to-bottom and left-to-right: American Crow vs Fish Crow, Pin-tailed Snipe vs Common Snipe, Redpoll (Hoary) vs Redpoll (Common), Ross's Goose vs Snow Goose, Sharp-shinned Hawk vs Cooper's Hawk, and Short-tailed Shearwater vs Sooty Shearwater. Images taken from eBird eBirdWeb.
  • ...and 4 more figures