Table of Contents
Fetching ...

An innovative data collection method to eliminate the preprocessing phase in web usage mining

Ozkan Canay, Umit Kocabicak

TL;DR

This work tackles the preprocessing bottleneck in web usage mining by introducing a server-side, application-based logging API embedded in a three-tier CAWIS framework that collects and stores usage data as homogeneous relational records. The method eliminates the need for extensive preprocessing and provides complete data ownership and faster data collection, suitable for real-time analytics, web analytics, and AI-driven insights. Experimental results from a university web application demonstrate scalable data capture (e.g., 22,104 sessions and 161,672 pageviews in 24 hours) and rich multi-dimensional analyses across users, devices, IPs, and search behavior. While offering clear practical benefits for privacy-preserving, on-site data collection, the approach acknowledges limitations such as missing client-side screen data and outlines future ETL-driven data warehouse integration for long-term analytics.

Abstract

The underlying data source for web usage mining (WUM) is commonly thought to be server logs. However, access log files ensure quite limited data about the clients. Identifying sessions from this messy data takes a considerable effort, and operations performed for this purpose do not always yield excellent results. Also, this data cannot be used for web analytics efficiently. This study proposes an innovative method for user tracking, session management, and collecting web usage data. The method is mainly based on a new approach for using collected data for web analytics extraction as the data source in web usage mining. An application-based API has been developed with a different strategy from conventional client-side methods to obtain and process log data. The log data has been successfully gathered by integrating the technique into an enterprise web application. The results reveal that the homogeneous structured data collected and stored with this method is more convenient to browse, filter, and process than web server logs. This data stored on a relational database can be used effortlessly as a reliable data source for high-performance web usage mining activity, real-time web analytics, or a functional recommendation system.

An innovative data collection method to eliminate the preprocessing phase in web usage mining

TL;DR

This work tackles the preprocessing bottleneck in web usage mining by introducing a server-side, application-based logging API embedded in a three-tier CAWIS framework that collects and stores usage data as homogeneous relational records. The method eliminates the need for extensive preprocessing and provides complete data ownership and faster data collection, suitable for real-time analytics, web analytics, and AI-driven insights. Experimental results from a university web application demonstrate scalable data capture (e.g., 22,104 sessions and 161,672 pageviews in 24 hours) and rich multi-dimensional analyses across users, devices, IPs, and search behavior. While offering clear practical benefits for privacy-preserving, on-site data collection, the approach acknowledges limitations such as missing client-side screen data and outlines future ETL-driven data warehouse integration for long-term analytics.

Abstract

The underlying data source for web usage mining (WUM) is commonly thought to be server logs. However, access log files ensure quite limited data about the clients. Identifying sessions from this messy data takes a considerable effort, and operations performed for this purpose do not always yield excellent results. Also, this data cannot be used for web analytics efficiently. This study proposes an innovative method for user tracking, session management, and collecting web usage data. The method is mainly based on a new approach for using collected data for web analytics extraction as the data source in web usage mining. An application-based API has been developed with a different strategy from conventional client-side methods to obtain and process log data. The log data has been successfully gathered by integrating the technique into an enterprise web application. The results reveal that the homogeneous structured data collected and stored with this method is more convenient to browse, filter, and process than web server logs. This data stored on a relational database can be used effortlessly as a reliable data source for high-performance web usage mining activity, real-time web analytics, or a functional recommendation system.
Paper Structure (20 sections, 3 equations, 8 figures, 6 tables)

This paper contains 20 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Proposed data collection method and improvement of WUM process
  • Figure 2: Three-tier organization of the proposed method
  • Figure 3: Sequence diagram representation of the way the log API works
  • Figure 4: The entity-relationship diagram of the physical data model
  • Figure 5: Top 20 users' numbers of pageviews and sessions by the most pageviews in all sessions
  • ...and 3 more figures