Table of Contents
Fetching ...

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder

TL;DR

TakeLab Retriever addresses the challenge of biased and incomplete access to Croatian news by providing a Croatian-focused semantic search engine that integrates an end-to-end NLP pipeline with a microservice architecture. It combines a custom scraper (scheduler, downloader, extractor) with an NLP stack (Core, NER, NEL, low-quality detector, IPTC topic classifier) and a web application to enable complex queries and rich visualizations over a growing archive of over ten million articles from 33 outlets as of Nov 2024. Key contributions include exact SimHash-based deduplication, Wikidata-driven entity linking, IPTC-based multi-label topic classification via Omikuji, and a GPU-accelerated NLP pipeline that supports reindexing. The system demonstrates scalable, unbiased, and precise retrieval for social science research, offering capabilities beyond general-purpose search engines and enabling insights into trends, patterns, and correlations in Croatian online news content. Ongoing work aims to enhance user experience, extend NLP capabilities (e.g., sentiment analysis, keyphrase extraction), and broaden outlet coverage.

Abstract

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

TL;DR

TakeLab Retriever addresses the challenge of biased and incomplete access to Croatian news by providing a Croatian-focused semantic search engine that integrates an end-to-end NLP pipeline with a microservice architecture. It combines a custom scraper (scheduler, downloader, extractor) with an NLP stack (Core, NER, NEL, low-quality detector, IPTC topic classifier) and a web application to enable complex queries and rich visualizations over a growing archive of over ten million articles from 33 outlets as of Nov 2024. Key contributions include exact SimHash-based deduplication, Wikidata-driven entity linking, IPTC-based multi-label topic classification via Omikuji, and a GPU-accelerated NLP pipeline that supports reindexing. The system demonstrates scalable, unbiased, and precise retrieval for social science research, offering capabilities beyond general-purpose search engines and enabling insights into trends, patterns, and correlations in Croatian online news content. Ongoing work aims to enhance user experience, extend NLP capabilities (e.g., sentiment analysis, keyphrase extraction), and broaden outlet coverage.

Abstract

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: 30-day rolling averages of published and crawled articles over time: (a) Rolling average number of published articles from news outlets within 12-month periods (starting with January 2000); (b) Rolling average number of crawled articles from news outlets within 3-month periods (from last two years).
  • Figure 2: An example of a complex query in TakeLab Retriever web application with combinations of different search constraints.
  • Figure 3: TakeLab Retriever web application search result for Nikola Tesla and Albert Einstein entities constraint in combination with the topic SCIENCE AND TECHNOLOGY on 33 Croatian news outlets with no articles hidden from the results.
  • Figure 4: The architecture of TakeLab Retriever.
  • Figure 5: The directed acyclic graph of the NLP pipeline in TakeLab Retriever used to semantically index articles with phrases, entities, and topics. The arcs depict which module depends on the previous ones in the pipeline.