Table of Contents
Fetching ...

Rehabilitating Homeless: Dataset and Key Insights

Anna Bykova, Nikolay Filippov, Ivan P. Yamshchikov

TL;DR

The paper addresses the scarcity of detailed, individual-level data for homelessness rehabilitation and presents a large anonymized dataset collected over a decade from the NGO Nochlezhka. It details a robust data pipeline including cleaning, de-duplication, anonymization, feature extraction (43 contract types, 8 statuses, 28 other features), and NLP-derived tags from social-worker notes, resulting in $6349$ records with $51$ features. A suite of classifiers is evaluated for predicting contract completion, with CatBoost and XGBoost delivering top performance up to $0.80$–$0.85$ F1 on balanced validation data, validating the dataset’s usefulness for prediction and insights. The analysis highlights practical, non-causal factors influencing rehabilitation outcomes (e.g., absence, age, social connections, disabilities) and discusses ethical considerations, aiming to empower NGOs and researchers to design better interventions and policies.

Abstract

This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness.

Rehabilitating Homeless: Dataset and Key Insights

TL;DR

The paper addresses the scarcity of detailed, individual-level data for homelessness rehabilitation and presents a large anonymized dataset collected over a decade from the NGO Nochlezhka. It details a robust data pipeline including cleaning, de-duplication, anonymization, feature extraction (43 contract types, 8 statuses, 28 other features), and NLP-derived tags from social-worker notes, resulting in records with features. A suite of classifiers is evaluated for predicting contract completion, with CatBoost and XGBoost delivering top performance up to F1 on balanced validation data, validating the dataset’s usefulness for prediction and insights. The analysis highlights practical, non-causal factors influencing rehabilitation outcomes (e.g., absence, age, social connections, disabilities) and discusses ethical considerations, aiming to empower NGOs and researchers to design better interventions and policies.

Abstract

This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness.
Paper Structure (17 sections, 6 figures, 2 tables)

This paper contains 17 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Contract types distribution
  • Figure 2: Contract statuses distribution
  • Figure 3: Top 20 most important features: Decision Tree Classifier with Random Oversampling
  • Figure 4: Top 20 most important features: Random Forest Classifier with Random Oversampling
  • Figure 5: Top 20 most important features: CatBoost Classifier with Random Oversampling
  • ...and 1 more figures