BIB-R is a benchmark for the interpretation of bibliographic records. It provides two datasets (T42 and BIB-RCAT) dedicated to the evaluation of the FRBRization process. The goal T42 is to identify the weak and strong points of a tool by testing all possible issues that libraries may face during FRBRization. The second dataset BIB-RCAT is extracted from catalogs of three different cultural institutions and can be used for comparing or experimenting with the data quality and size of data that typically is found in real world catalogs. The expected FRBR results (gold standard) are included in these datasets to enable evaluation. The MARC catalogs are provided in MARC/XML format while the FRBR collections are available in RDF/XML (Generated by the Jena API).
If you use this work, please cite this paper : J. Decourselle et.al.: BIB-R: A Benchmark for the Interpretation of Bibliographic Records. TPDL 2016, Hannover, Germany. [Bibtex], [PDF]
@inproceedings{DecourselleTPDL16, author = {Joffrey Decourselle and Fabien Duchateau and Trond Aalberg and Naimdjon Takhirov and Nicolas Lumineau}, title = {{BIB-R:} {A} Benchmark for the Interpretation of Bibliographic Records}, booktitle = {Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, {TPDL} 2016, Hannover, Germany, September 5-9, 2016, Proceedings}, pages = {163--174}, year = {2016} }
T42 has been built for benchmarking FRBRization solutions
which deal with MARC records.
The dataset is composed by 42 different tests where each relates a specific FRBR
pattern representation and can include specific issues.
We provide both records in UNIMARC and MARC21. Every record comes from a real-world library catalog and has been adapted to the tests. Original records have been transformed in several steps, first in an automated way and then with a manual validation.
More details.
Browse the dataset T42 on Github
BIB-RCAT is a dataset of MARC21 records accompanied by a
FRBR gold version to evaluate a FRBRization solution. The collection is bigger than T42 and records comes from real-world library catalogs.
Browse the dataset BIB-RCAT on Github
All MARC records are proposed in MARC/XML format. For Java applications, they can be parsed by the MARC4J API. The FRBR files are provided in RDF/XML format. They have been generated using the Jena API. Thus, the files can be parsed back by Jena to get the triples as POJOs. The concepts used in the FRBR gold files use standards vocabularies RDA & FRBRer. We also provide a mapping file in RDF/XML where each concept used in the datasets can be mapped to another vocabulary. This file was also generated using the Jena API and can be parsed.
The list of additional resources are listed below:
@inproceedings{DecourselleTPDL16, author = {Joffrey Decourselle and Fabien Duchateau and Trond Aalberg and Naimdjon Takhirov and Nicolas Lumineau}, title = {{BIB-R:} {A} Benchmark for the Interpretation of Bibliographic Records}, booktitle = {Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, {TPDL} 2016, Hannover, Germany, September 5-9, 2016, Proceedings}, pages = {163--174}, year = {2016} }
@inproceedings{DecourselleJCDL16, author = {Joffrey Decourselle and Fabien Duchateau and Trond Aalberg and Naimdjon Takhirov and Nicolas Lumineau}, title = {Open Datasets for Evaluating the Interpretation of Bibliographic Records}, booktitle = {Proceedings of the 16th {ACM/IEEE-CS} on Joint Conference on Digital Libraries, {JCDL} 2016, Newark, NJ, USA, June 19 - 23, 2016}, pages = {253--254}, year = {2016} }
These datasets are released under a CC BY-NC licence.
This work has been partially supported by the French Agency ANRT (www.anrt.asso.fr), the company PROGILONE (www.progilone.com/), a PHC Aurora funding (#34047VH) and a CNRS PICS funding (#PICS06945).