大赛简介 |
段落检索(passage retrieval)是指从大规模语料库中找出和用户查询最相关段落的过程。段落检索作为许多自然语言处理任务中的关键组件,是自然语言处理和人工智能领域的重要前沿课题,近年来受到了学术界和工业界的广泛关注。 Passage retrieval is the process of obtaining the most relevant passages to the user query from a large-scale text corpus. As a key component in many natural language processing tasks, passage retrieval is an important cutting-edge topic in the fields of natural language processing and artificial intelligence, which has received wide attention from both academia and industry.
近年来,得益于带有高质量人工标注的大规模段落检索数据集的出现,基于稠密向量的表示学习方法在段落检索领域取得了重大进展。 然而,由于缺乏相应的大规模中文检索数据集,在中文场景下应用稠密检索模型的相关研究受到了极大的限制。为了推动中文段落检索技术的发展,我们利用真实场景下的用户搜索日志,建立了首个大规模高质量中文段落检索数据集:DuReader_retrieval,并采用多路集成召回标注及训练、测试集语义去重等策略,提升了开发集和测试集的标注质量,保证评估的效果。DuReader_retrieval中的样本均来自于实际的应用场景,考察点丰富多样,覆盖了真实应用场景下诸多难以解决的问题。 In recent years, dense representation learning has made significant progress in the field of passage retrieval due to the advent of the large-scale dataset with high-quality human annotations. However, due to the lack of corresponding Chinese large-scale datasets, relevant studies on applying dense retrieval models in the Chinese context are restricted. In order to remove this barrier, based on the logs submitted by real users to Baidu Search, we build the first large-scale Chinese passage retrieval dataset: DuReader_retrieval, and improve its quality by ensemble retrieval with human annotation, removing the semantically similar questions, etc. The data in DuReader_retrieval all come from actual application scenarios, which have various evaluation points, covering many challenges in real applications.
本次评测首次开放了全球首个来源于搜索引擎真实应用场景的高质量中文段落检索数据集DuReader_retrieval,旨在为研究者和开发者提供学术和技术交流的平台, 进一步提升机器中文段落检索的研究水平,推动自然语言处理和人工智能领域技术和应用的发展。本次竞赛将在第七届“语言与智能高峰论坛”举办技术交流论坛和颁奖仪式。 诚邀学术界和工业界的研究者和开发者参加本次竞赛! This shared task firstly provides the world’s first high-quality Chinese information retrieval dataset for real application scenarios of search engines: DuReader_retrieval. It aims to provide a platform to improve the SoTA of Chinese information retrieval for researchers and developers. This shared task will hold the workshop and award ceremony in the seventh “Language and Intelligence Summit Forum”. We sincerely invite researchers and developers from academia and industry to participate in this competition! |