Table of Contents
Fetching ...

ConFit: Improving Resume-Job Matching using Data Augmentation and Contrastive Learning

Xiao Yu, Jinzhong Zhang, Zhou Yu

TL;DR

ConFit tackles the sparsity of resume–job interaction data by combining data augmentation and contrastive learning to produce dense resume and job embeddings ($E_\theta$) that support fast inner-product matching ($s_\theta(R,J)=E_\theta(R)^T E_\theta(J)$). Data augmentation via EDA and ChatGPT paraphrasing increases labeled pairs, while in-batch and hard negatives in contrastive learning scale supervision to $\mathcal{O}(B^2)$ per batch, improving embedding quality. Empirical results on two real-world datasets show ConFit outperforms strong baselines including BM25 and OpenAI text-ada-002 in most ranking tasks, with notable gains in MAP and $nDCG@10$ for both ranking resumes and jobs. The approach yields retrieval-friendly, scalable performance suitable for large candidate pools, and is extensible to debiasing and preference-aware extensions for real-world deployment.

Abstract

A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes, and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction records in resume-job datasets are sparse. Different from many prior work that use complex modeling techniques, we tackle this sparsity problem using data augmentations and a simple contrastive learning approach. ConFit first creates an augmented resume-job dataset by paraphrasing specific sections in a resume or a job post. Then, ConFit uses contrastive learning to further increase training samples from $B$ pairs per batch to $O(B^2)$ per batch. We evaluate ConFit on two real-world datasets and find it outperforms prior methods (including BM25 and OpenAI text-ada-002) by up to 19% and 31% absolute in nDCG@10 for ranking jobs and ranking resumes, respectively.

ConFit: Improving Resume-Job Matching using Data Augmentation and Contrastive Learning

TL;DR

ConFit tackles the sparsity of resume–job interaction data by combining data augmentation and contrastive learning to produce dense resume and job embeddings () that support fast inner-product matching (). Data augmentation via EDA and ChatGPT paraphrasing increases labeled pairs, while in-batch and hard negatives in contrastive learning scale supervision to per batch, improving embedding quality. Empirical results on two real-world datasets show ConFit outperforms strong baselines including BM25 and OpenAI text-ada-002 in most ranking tasks, with notable gains in MAP and for both ranking resumes and jobs. The approach yields retrieval-friendly, scalable performance suitable for large candidate pools, and is extensible to debiasing and preference-aware extensions for real-world deployment.

Abstract

A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes, and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction records in resume-job datasets are sparse. Different from many prior work that use complex modeling techniques, we tackle this sparsity problem using data augmentations and a simple contrastive learning approach. ConFit first creates an augmented resume-job dataset by paraphrasing specific sections in a resume or a job post. Then, ConFit uses contrastive learning to further increase training samples from pairs per batch to per batch. We evaluate ConFit on two real-world datasets and find it outperforms prior methods (including BM25 and OpenAI text-ada-002) by up to 19% and 31% absolute in nDCG@10 for ranking jobs and ranking resumes, respectively.
Paper Structure (47 sections, 4 equations, 5 figures, 9 tables)

This paper contains 47 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Model architecture used to encode a resume or a job post, formatted as a collection of $p$ text fields (see \ref{['sec:More Details on Dataset and Preprocessing']} for a full example of resume/job).
  • Figure 2: Runtime comparison between neural-based methods. MIPS are maximum inner product search methods that are supported by FAISS FAISS. Non-linear methods require an additional forward pass to produce a score between a resume-job pair. Results are averages over three runs.
  • Figure 3: Visualizing resume embeddings from ConFit using t-SNE. Colors are assigned using each resume's desired industry. Top-3 most frequent industries are color-coded for easier viewing.
  • Figure 4: ConFit error analysis. We find 44% of the errors made are due to reasons not identifiable using resume/job documents alone, and 28% due to a candidate's resume satisfying all the job requirements but is less competent than other competing candidates.
  • Figure A1: Resume embeddings produced by various methods in \ref{['tbl:main_exp_bert']} with BERT-base-multilingual-cased as backbone encoder. Colors assigned using each resume's desired industry. Top-3 most frequent industries are color-coded for easier viewing. BERT-base refers to raw embedding produced by BERT-base-multilingual-cased.