Table of Contents
Fetching ...

Shy Guys: A Light-Weight Approach to Detecting Robots on Websites

Rémi Van Boxem, Tom Barbette, Cristel Pelsser, Ramin Sadre

Abstract

Automated bots now account for roughly half of all web requests, and an increasing number deliberately spoof their identity to either evade detection or to not respect robots.txt. Existing countermeasures are either resource-intensive (JavaScript challenges, CAPTCHAs), cost-prohibitive (commercial solutions), or degrade the user experience. This paper proposes a lightweight, passive approach to bot detection that combines user-agent string analysis with favicon-based heuristics, operating entirely on standard web server logs with no client-side interaction. We evaluate the method on over 4.6 million requests containing 54,945 unique user-agent strings collected from website hosted all around the earth. Our approach detects 67.7% of bot traffic while maintaining a false-positive rate of 3%, outperforming state of the art (less than 20%). This method can serve as a first line of defence, routing only genuinely ambiguous requests to active challenges and preserving the experience of legitimate users.

Shy Guys: A Light-Weight Approach to Detecting Robots on Websites

Abstract

Automated bots now account for roughly half of all web requests, and an increasing number deliberately spoof their identity to either evade detection or to not respect robots.txt. Existing countermeasures are either resource-intensive (JavaScript challenges, CAPTCHAs), cost-prohibitive (commercial solutions), or degrade the user experience. This paper proposes a lightweight, passive approach to bot detection that combines user-agent string analysis with favicon-based heuristics, operating entirely on standard web server logs with no client-side interaction. We evaluate the method on over 4.6 million requests containing 54,945 unique user-agent strings collected from website hosted all around the earth. Our approach detects 67.7% of bot traffic while maintaining a false-positive rate of 3%, outperforming state of the art (less than 20%). This method can serve as a first line of defence, routing only genuinely ambiguous requests to active challenges and preserving the experience of legitimate users.

Paper Structure

This paper contains 21 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the bot detection methodology. Raw logs from multiple web server formats are anonymized, normalized, and then analyzed through two complementary approaches: favicon-based analysis and user-agent header-based analysis.
  • Figure 2: Daily unique IP addresses issuing favicon requests and POST requests to /course/ over the observation period.
  • Figure 3: Distribution of claimed Android versions in user-agent strings.
  • Figure 4: Distribution of claimed Chrome and Firefox versions in user-agent strings.
  • Figure 5: Overlap among bot detection methods across 54,945 unique user-agent strings, visualized with an UpSet plot lex_upset_2014. The two largest intersections, deprecated OS $\cap$ deprecated browser (27k) and deprecated browser alone (22k), dominate, indicating that the majority of user-agent strings are caught using version-based heuristics.
  • ...and 1 more figures