Description of the Corpus

This page contains the corpus information for TalentCLEF 2025 tasks. You can access the link to download in datasets page.

Task A: Multilingual Job Title Matching CorpusSummary:

The corpus used for Task A consists of a set of job titles in three languages: English, Spanish and German, from different job domains and professional sectors. These job titles have been collected and processed in order to facilitate the identification and comparison of equivalent titles across languages.

The training corpus has been generated using public terminologies, ensuring that the job titles are representative of a wide range of job domains and aligned with standard market terminology.

On the other hand, the validation and test corpora have been annotated by domain experts, following well-defined guidelines to ensure consistency and quality of labels. This annotation process, performed with specialized tools, included several stages of quality control to ensure that the labels were accurate and that the annotated titles accurately reflected the relationships between the different languages in a work environment.

Data:

  1. Training Set: The training data is provided in a tabular format with three columns:

    • family_id: The ISCO family id representing the group to which the job identifier belongs.
    • id: An ESCO identifier indicating the origin of the pair’s job titles.
    • jobtitle_1: The first job title in the pair.
    • jobtitle_2: A second job title related to jobtitle_1.

    Each dataset is provided in separate files for each language involved in the task. The files are named according to the language, with the following format:

    • taskA_training_en.tsv: Contains related job titles in English.
    • taskA_training_es.tsv: Contains related job titles in Spanish.
    • taskA_training_de.tsv: Contains related job titles in German.

    An example of the content of these files is shown below:

family_ididjobtitle_1jobtitle_2
http://data.europa.eu/esco/isco/C2512http://data.europa.eu/esco/occupation/f2b15a0e-e65a-438a-affb-29b9d50b77d1desarrollador de softwaredesarrolladora de soluciones
http://data.europa.eu/esco/isco/C2512http://data.europa.eu/esco/occupation/f2b15a0e-e65a-438a-affb-29b9d50b77d1desarrollador de softwareingeniera de aplicaciones
http://data.europa.eu/esco/isco/C2512http://data.europa.eu/esco/occupation/d0aa0792-4345-474b-9365-686cf4869d2ediseñador de softwareingeniero de software
  1. Validation Set: The validation set is structured into three diferent files: queries, corpus elements and q_rels, and is provided separately for each language.

    • Queries: The queries file contains the following fields:

      • q_id: A unique identifier for the query.
      • jobtitle: The job title used as the query.
    • Corpus Elements: The corpus elements file contains the following fields:

      • c_id: A unique identifier for each corpus element.
      • jobtitle: The job title present in the corpus.
    • qrels: This file defines the relationship between the query and the corpus elements. It does not include a column header, but one is shown here for illustrative purposes.

      • q_id: The identifier of the query.
      • iter: A reserved field (always 0).
      • c_id: The identifier of the corresponding corpus element.
      • relevance: A binary score (0 or 1) indicating the relevance of the corpus element to the query, where 1 signifies relevant and 0 non-relevant.

    We will provide validation set in english, spanish, german and chinese.

    Example of the content of these files for english:

    queries

    q_idjobtitle
    13d animator

    corpus_elements

    c_idjobtitle
    1animation artist
    23d character animator
    3character technical director
    4character designer
    5animation lead
    63d generalist
    7animator
    8character rigger
    9character animator

    q_rels

    q_iditerc_idrelevance
    1021
    1031
    1041
    1051
    1061
    1071
    1081
    1091
  2. Test Set: The test set consists of two components, which are designed to evaluate system predictions based on language and job title retrieval tasks. The participant should generate a q_rels based on the queries and corpus elements provided.

    • Queries: Contains the following fields:

      • q_id: A unique identifier for the query.
      • jobtitle: The job title used as the query.
      • lang: The language of the corpus element’s job title.
    • Corpus Elements: Contains:

      • q_id: A unique identifier for each corpus element.
      • jobtitle: The job title from the corpus element.
      • lang: The language of the corpus element’s job title.
Task B: Job Title-Based Skill Prediction Corpus

Summary:

The dataset is designed to support job title-based skill prediction tasks in English across various job domains and professional sectors. It includes job titles and associated skills collected and processed to facilitate the training of models to solve this task.

As with Task A, the training data uses public terminologies to represent a broad spectrum of job domains, while the validation and test sets are annotated by domain experts. This expert annotation follows strict guidelines and quality control measures to ensure consistent labeling and accurate representation of job-title-to-skill relationships.

Data:

  1. Training Set:

For generating the training data for Task B, the information available in ESCO has been used. We have prepared the training data in three separate files: job2skill.tsv, jobid2terms.json and skillid2terms.json.

  1. Validation Set: The validation set is divided into three diferent files: queries, corpus elements and q_rels:

    • Queries: Contains the following fields:

      • q_id: A unique identifier for the query.
      • jobtitle: The job title used as the query.
    • Corpus Elements: Contains:

      • c_id: A unique identifier for each corpus element.
      • esco_uri: The ESCO URIs associated to c_id.
      • skill_aliases: The list aliases of the ESCO skill
    • q_rels: This file maps the relationship between the query and the corpus elements:

      • q_id: The identifier of the query.
      • iter: A reserved field (always 0).
      • c_id: The identifier of the corresponding corpus element.
      • relevance: A binary score (0 or 1) indicating the relevance of the corpus element to the query, where 1 signifies relevant and 0 non-relevant.

    Example of the content of these files:

    queries

    q_idjobtitle
    dev_qb_jt_1corporate governance analyst

    corpus_elements

    c_idesco_uriskill_aliases
    dev_cb_sk_1http://data.europa.eu/esco/skill/1c460d2d-90c6-4fc9-ad49-febb6e15605a[‘pricing plans’, ‘price strategies’, ‘pricing tactics’, ‘pricing strategies’, ‘pricing strategy’]
    dev_cb_sk_2http://data.europa.eu/esco/skill/301a6581-e983-4bb6-8b31-b3ee2cbc2392[‘putting out fires’, …, ‘fires putting out’]
    dev_cb_sk_3http://data.europa.eu/esco/skill/a4881e54-6055-4e61-855a-0a56ced7cfa3[‘online assessment’, ‘analysis of web strategy’, ‘web presence assessment’, ‘web strategy assessment’]
    dev_cb_sk_4http://data.europa.eu/esco/skill/efda73b4-5212-40a7-b2f8-d2f754ffdf2b[‘keeping up with trends’, ‘keep pace with trends’, ‘follow trends’, …, ‘keep up with trends’]
    dev_cb_sk_5http://data.europa.eu/esco/skill/22a173f5-868c-4d82-87e6-beed500ce070[‘prepare tax returns form’, …, ‘make tax returns forms ready’, ‘preparing tax returns forms’]
    dev_cb_sk_6http://data.europa.eu/esco/skill/97b890ff-acd7-46ad-8d3a-4186f4d42bbf[’tuning procedures’, …, ’tuning skills’, ’tuning techniques’]
    dev_cb_sk_7http://data.europa.eu/esco/skill/d5c20065-1d1f-446b-8143-9d1e180c512b[‘iconography methods’, ‘iconography’]

    q_rels

    q_iditerc_idrelevance
    dev_qb_jt_10dev_cb_sk_10341
    dev_qb_jt_10dev_cb_sk_10871
    dev_qb_jt_10dev_cb_sk_10881
    dev_qb_jt_10dev_cb_sk_10991
    dev_qb_jt_10dev_cb_sk_11041
    dev_qb_jt_10dev_cb_sk_11071
    dev_qb_jt_10dev_cb_sk_11101
    dev_qb_jt_10dev_cb_sk_11121
  2. Test Set: The test set consists of two files, queries and corpus elements. The participant should generate a q_rels file as prediction based on the queries and corpus elements provided.

    • Queries: Contains the following fields:

      • q_id: A unique identifier for the query.
      • jobtitle: The job title used as the query.
    • Corpus Elements: Contains:

      • q_id: A unique identifier for each corpus element.
      • skill: The skill associated with the corpus element.