Data publication platform of Vrije Universiteit Amsterdam

Arnoult, Sophie

VOC GM NER corpus

2022-12-13T12:21:01.601203 Open - freely retrievable

Corpus and training data for Named-Entity Recognition from the VOC General Letters. The corpus consist of a selection of letters from the Generale Missiven, a subset of the Overgebleven Brieven en Papieren corpus of the United East India Company (VOC). These letters were reports sent by governor generals and administrators of the VOC to the board, from locations where the VOC was active (Indonesia and other parts of Asia as well as South Africa). The letters for the current corpus were edited and digitalized by the Huygens Institute of Netherlands History between 1960 and 2007 as part of the Rijks Geschiedkundige Publicatiƫn (RGP) series. In this edition, letters were transcribed in part, while other parts were summarized. The data in the current package consist of a selection of these letters, spread in time, where the original text and modern additions (notes and passage summaries) are extracted into separate documents to allow for training on either the historical text or modern additions. The entities identified in the data are: persons, locations, organisations (mainly the VOC itself) and ships. These are completed with forms derived from location or religion names. The 'corpus' folder contains files for the historical text and modern notes of each letter, in CoNLL 2002 format, taking paragraphs or separate notes as units for segmentation. The 'datasplit_all_standard' folder contains training, validation and test data for the 'standard' NER experiment on all the data referred in the companion publication, splitting sequences longer than 256 subtokens. For more information, see Arnoult et. al, 2021. Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts. In Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Code, intermediary data and more information on the collection process can be found on the cltl/voc-missives package on Zenodo and GitHub.

NER Digital Humanities Early modern Dutch

View contents

Questions?