An innovative data collection method to eliminate the preprocessing phase in web usage mining
Ozkan Canay, Umit Kocabicak
TL;DR
This work tackles the preprocessing bottleneck in web usage mining by introducing a server-side, application-based logging API embedded in a three-tier CAWIS framework that collects and stores usage data as homogeneous relational records. The method eliminates the need for extensive preprocessing and provides complete data ownership and faster data collection, suitable for real-time analytics, web analytics, and AI-driven insights. Experimental results from a university web application demonstrate scalable data capture (e.g., 22,104 sessions and 161,672 pageviews in 24 hours) and rich multi-dimensional analyses across users, devices, IPs, and search behavior. While offering clear practical benefits for privacy-preserving, on-site data collection, the approach acknowledges limitations such as missing client-side screen data and outlines future ETL-driven data warehouse integration for long-term analytics.
Abstract
The underlying data source for web usage mining (WUM) is commonly thought to be server logs. However, access log files ensure quite limited data about the clients. Identifying sessions from this messy data takes a considerable effort, and operations performed for this purpose do not always yield excellent results. Also, this data cannot be used for web analytics efficiently. This study proposes an innovative method for user tracking, session management, and collecting web usage data. The method is mainly based on a new approach for using collected data for web analytics extraction as the data source in web usage mining. An application-based API has been developed with a different strategy from conventional client-side methods to obtain and process log data. The log data has been successfully gathered by integrating the technique into an enterprise web application. The results reveal that the homogeneous structured data collected and stored with this method is more convenient to browse, filter, and process than web server logs. This data stored on a relational database can be used effortlessly as a reliable data source for high-performance web usage mining activity, real-time web analytics, or a functional recommendation system.
