BIB-R Datasets: Overview

BIB-R is a benchmark for the interpretation of bibliographic records. It provides two datasets (T42 and BIB-RCAT) dedicated to the evaluation of the FRBRization process. The goal T42 is to identify the weak and strong points of a tool by testing all possible issues that libraries may face during FRBRization. The second dataset BIB-RCAT is extracted from catalogs of three different cultural institutions and can be used for comparing or experimenting with the data quality and size of data that typically is found in real world catalogs. The expected FRBR results (gold standard) are included in these datasets to enable evaluation. The MARC catalogs are provided in MARC/XML format while the FRBR collections are available in RDF/XML (Generated by the Jena API).

If you use this work, please cite this paper : J. Decourselle et.al.: BIB-R: A Benchmark for the Interpretation of Bibliographic Records. TPDL 2016, Hannover, Germany. [Bibtex], [PDF]

@inproceedings{DecourselleTPDL16,
  author 	= {Joffrey Decourselle and
  			Fabien Duchateau and
  			Trond Aalberg and
  			Naimdjon Takhirov and
  			Nicolas Lumineau},
  title 	= {{BIB-R:} {A} Benchmark for the Interpretation of Bibliographic Records},
  booktitle = {Research and Advanced Technology for Digital Libraries - 20th International
               Conference on Theory and Practice of Digital Libraries, {TPDL} 2016,
				Hannover, Germany, September 5-9, 2016, Proceedings},
  pages 	= {163--174},
  year 		= {2016}
}

T42

T42 has been built for benchmarking FRBRization solutions which deal with MARC records. The dataset is composed by 42 different tests where each relates a specific FRBR pattern representation and can include specific issues. We provide both records in UNIMARC and MARC21. Every record comes from a real-world library catalog and has been adapted to the tests. Original records have been transformed in several steps, first in an automated way and then with a manual validation. More details.
Browse the dataset T42 on Github

BIB-RCAT

BIB-RCAT is a dataset of MARC21 records accompanied by a FRBR gold version to evaluate a FRBRization solution. The collection is bigger than T42 and records comes from real-world library catalogs.
Browse the dataset BIB-RCAT on Github

Getting Started

All MARC records are proposed in MARC/XML format. For Java applications, they can be parsed by the MARC4J API. The FRBR files are provided in RDF/XML format. They have been generated using the Jena API. Thus, the files can be parsed back by Jena to get the triples as POJOs. The concepts used in the FRBR gold files use standards vocabularies RDA & FRBRer. We also provide a mapping file in RDF/XML where each concept used in the datasets can be mapped to another vocabulary. This file was also generated using the Jena API and can be parsed.

Additional resources

The list of additional resources are listed below:

The mappings file: bib-r.github.io/mappings.xml
The results of the different experiments done with BIB-R and three recent FRBRization tools: bib-r.github.io/experiments.pdf
The list of specification metrics: bib-r.github.io/specifications-metrics.txt

Publications

J. Decourselle et.al.: BIB-R: A Benchmark for the Interpretation of Bibliographic Records. TPDL 2016, Hannover, Germany [Bibtex], [PDF]

@inproceedings{DecourselleTPDL16,
  author 	= {Joffrey Decourselle and
  			Fabien Duchateau and 
  			Trond Aalberg and 
  			Naimdjon Takhirov and
  			Nicolas Lumineau},
  title 	= {{BIB-R:} {A} Benchmark for the Interpretation of Bibliographic Records},
  booktitle = {Research and Advanced Technology for Digital Libraries - 20th International
               Conference on Theory and Practice of Digital Libraries, {TPDL} 2016,
				Hannover, Germany, September 5-9, 2016, Proceedings},
  pages 	= {163--174},
  year 		= {2016}
}

J. Decourselle et. al.: Open Datasets for Evaluating the Interpretation of Bibliographic Records. JCDL 2016, Newark, NJ, USA [Bibtex]. [PDF].

@inproceedings{DecourselleJCDL16,
  author    = {Joffrey Decourselle and
               Fabien Duchateau and
               Trond Aalberg and
               Naimdjon Takhirov and
               Nicolas Lumineau},
  title     = {Open Datasets for Evaluating the Interpretation of Bibliographic Records},
  booktitle = {Proceedings of the 16th {ACM/IEEE-CS} on Joint Conference on Digital
               Libraries, {JCDL} 2016, Newark, NJ, USA, June 19 - 23, 2016},
  pages     = {253--254},
  year      = {2016}
}

Licence

These datasets are released under a CC BY-NC licence. Licence

Acknowledgments

This work has been partially supported by the French Agency ANRT (www.anrt.asso.fr), the company PROGILONE (www.progilone.com/), a PHC Aurora funding (#34047VH) and a CNRS PICS funding (#PICS06945).