Description of the Corpus

This page contains the corpus information for TalentCLEF 2026 tasks. You can access the link to download in datasets page.

Task A: Contextualized Job-Person MatchingSummary:

The Task A corpus consists of job descriptions and résumés (CVs) in two languages: English and Spanish. Documents are synthetically generated from structured data derived from real job descriptions and curricula vitae, preserving realism while protecting privacy. The dataset covers multiple professional domains and sectors, selected via clustering over semantic vector representations to ensure diversity and representativeness.

Synthetic documents were generated in English and then manually reviewed to ensure quality and internal coherence. Job–candidate matches were annotated by human experts, who determined whether each résumé is suitable for a given job offer. The Spanish version was created through parallel translation using LLMs and validated by human reviewers to ensure semantic consistency and content equivalence across languages.

Data:

  1. Training Set:

    No training set is provided for this task. However, participants are encouraged to use external resources or data from previous TalentCLEF editions if needed for their problem modeling and solution development.

  2. Development Set:

    For each language (English and Spanish),the development set is structured into three different components: queries, corpus, and qrels.tsv.

    • Queries: A folder containing 10 job description files. Each file is named with its unique identifier (e.g., 1234, 5678), and the file name serves as the q_id for that query.

    • Corpus: A folder containing 472 résumé files. Each file is named with its unique identifier (e.g., 1, 2, 3), and the file name serves as the c_id for that corpus document.

    • qrels.tsv: This file defines the relationship between the queries and the corpus elements. It does not include a column header, but one is shown here for illustrative purposes:

      • q_id: The identifier of the query (corresponding to the query file name).
      • iter: A reserved field (always 0).
      • c_id: The identifier of the corresponding corpus element (corresponding to the corpus file name).
      • relevance: A binary score (0 or 1) indicating the relevance of the corpus element to the query, where 1 signifies relevant and 0 non-relevant.

      Example structure of qrels.tsv:

      q_iditerc_idrelevance
      1234011
      1234050
      12340121
      5678021
      5678080
  3. Test Set:

Task B: Job-Skill Matching with Skill Type Classification

Summary:

This dataset supports job title–based skill prediction in English across multiple job domains and professional sectors. It includes job titles and associated skills, curated and processed to facilitate training and evaluation for this task.

Participants must retrieve the skills from a gazetteer that best match each job title and classify each retrieved skill as core or contextual. Core skills are required to perform a job regardless of employer or work setting (i.e., essential for the role). Contextual skills depend on factors such as the organization, industry, or specific project, and can therefore be considered optional.

For example, a software engineer may be required to develop mobile applications in some contexts depending on the product and technical stack, but will generally be expected to write code. In this case, write code is a core skill, while mobile application development is a contextual skill.

Data:

  1. Training Set:

    For generating the training data for Task B, the information available in ESCO has been used. We have prepared the training data in three separate files: job2skill.tsv, jobid2terms.json and skillid2terms.json.

  2. Validation Set:

  3. Test Set:

Last modified February 1, 2026: Update datasets description (81ca478)