JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Explainable & Ethical AI
Published: arXiv: 2601.23183v1
Authors

Casimiro Pio Carrino Paula Estrella Rabih Zbib Carlos Escolano José A. R. Fonollosa

Abstract

We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa-benchmark

Paper Summary

Problem
The main problem addressed in this research paper is the lack of benchmarks for evaluating the performance, fairness, and bias of Large Language Models (LLMs) in Human Resource (HR) tasks, specifically in the analysis of résumés for matching with job descriptions. This task involves asking questions about the skills, experience, and background of a candidate in relation to a job description, and is a critical use case of LLMs in HR.
Key Innovation
The key innovation of this paper is the introduction of JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. JobResQA is a curated, synthetic, multilingual QA dataset of over 105 résumé-JD pairs (581 QA items) that supports both short and long answers across three complexity levels: basic (extractive), intermediate (multi-passage), and complex (cross-document reasoning).
Practical Impact
The practical impact of this research is significant, as it provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The JobResQA dataset can be used to evaluate the performance of LLMs on HR tasks, identify biases and fairness issues, and develop more accurate and transparent HR systems. This can lead to better candidate matching, reduced bias in hiring decisions, and improved overall HR processes.
Analogy / Intuitive Explanation
Imagine you are a recruiter trying to match a candidate with a job opening. You need to read the candidate's résumé and the job description to determine if they have the required skills and experience. This process can be time-consuming and prone to bias. LLMs can help automate this process by analyzing the résumé and job description, but they need to be evaluated and tested to ensure they are accurate and fair. JobResQA is like a test dataset that allows developers to evaluate the performance of LLMs on this task, identify areas for improvement, and develop more accurate and transparent HR systems.
Paper Information
Categories:
cs.CL
Published Date:

arXiv ID:

2601.23183v1

Quick Actions