Datasets

The data will be hosted on the Zenodo platform under the NLP in HR community, following the file structure outlined below. Each time new data is added, an updated version of the dataset will be published on the platform.

Access the Zenodo download page

The dataset structure on Zenodo is organized into two *.zip files, TaskA and TaskB, each containing training, validation and test folders to suuport different stages of model development. Until the official release of the full training set, users can access a sample version of the data through the sampleset_TaskA.zip and sampleset_TaskB.zip files.

TaskA includes language-specific subfolders within the training and validation directories, covering English, Spanish, German, and Chinese job title data. The training folders for TaskA contain language-specific .tsv files for each respective language. Validation folders include three essential files—queries, corpus_elements, and q_rels—for evaluating model relevance to search queries. TaskA’s test folder has queries and corpus_elements files for testing every language considered. Participant can combine queries and corpus elements for the cross-lingual evaluation of the Task. The data can be found in the TaskA.zip file.

  • 🗜️️ TaskA
      • 📁 training
        • 📁 english
          • 📄 taskA_training_en.tsv
        • 📁 spanish
          • 📄 taskA_training_es.tsv
        • 📁 german
          • 📄 taskA_training_de.tsv
      • 📁 validation
        • 📁 english
          • 📄 queries
          • 📄 corpus_elements
          • 📄 qrels.tsv
        • 📁 spanish
        • 📁 german
        • 📁 chinese
      • 📁 test
          • 📁 english
            • 📄 queries
            • 📄 corpus_elements
          • 📁 spanish
          • 📁 german
          • 📁 chinese

    • TaskB follows a similar structure but without language-specific subfolders, providing general .tsv files for training, validation, and testing. This consistent file organization enables efficient data access and structured updates as new data versions are published. The data can be found in the TaskB.zip file.

      • 🗜️️ TaskB
        • 📁 training
          • 📄 job2skill.tsv
          • 📄 jobid2terms.json
          • 📄 skillid2terms.json
        • 📁 validation
          • 📄 queries
          • 📄 corpus_elements
          • 📄 q_rels
        • 📁 test
          • 📄 queries
          • 📄 corpus_elements
    Last modified April 18, 2025: Update data section for test set (f9fa5ac)