beir/nfcorpus

1. 사용법

1.1. 모든 데이터 순회

1.2. 개별 데이터 접근

2. 속성

2.1. doc

2.2. query

2.3. qrel

3. 통계

4. 인용

5. 출처

6. 라이센스

1. 사용법

1.1. 모든 데이터 순회

from hamu_tool.dataset import DataLoader

loader = DataLoader.load('beir/nfcorpus')

for doc in loader.get_docs():
    print(doc.id, doc.text, doc.title, doc.url)
    break

for query in loader.get_queries():
    print(query.id, query.text, query.url)
    break

for qrel in loader.get_qrels('[mode]'):
    print(qrel.qid, qrel.did, qrel.score)
    break

1.2. 개별 데이터 접근

from hamu_tool.dataset import DataLoader

loader = DataLoader.load('beir/nfcorpus')

doc = loader.get_doc('[did]')
print(doc)

query = loader.get_query('[qid]')
print(query)

qrel = loader.get_qrel('[mode]', '[qid]')
print(qrel)

2. 속성

2.1. doc

속성	자료형
id	str
text	str
title	str
url	str

2.2. query

속성	자료형
id	str
text	str
url	str

2.3. qrel

속성	자료형
qid	str
did	str
score	int

[mode]: test, dev, train

3. 통계

수치		값
Task		Bio-Medical Information Retrieval
Domain		Bio-Medical
# Query		3,237
# Doc		3,633
# Qrel	test	12,334
	dev	11,385
	train	110,575
Average Rel D/Q	test	3.81
	dev	3.52
	train	34.16
Average Query Length (words)		3.32
Average Doc Length (words)		220.98

4. 인용

@inproceedings{Boteva2016Nfcorpus,
  title = "A Full-Text Learning to Rank Dataset for Medical Information Retrieval",
  author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler",
  booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})",
  location = "Padova, Italy",
  publisher = "Springer",
  year = 2016
}
@article{Thakur2021Beir,
  title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
  author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", 
  journal = "arXiv preprint arXiv:2104.08663",
  month = "4",
  year = "2021",
  url = "https://arxiv.org/abs/2104.08663",
}

5. 출처

6. 라이센스

CC BY-SA 4.0