The Dutch Research Council (NWO) has allocated € 3.8 million to the GLOBALISE project that will digitize the historical archive.
The research group Computational Linguistics & Text Mining Lab (CLTL) led by Professor of Computational Lexicology Piek Vossen will provide the language technology to realize digital access to the VOC's UNESCO archive.
Currently the immense archive of the UNESCO Memory of the World of the Dutch East India Company (VOC) makes conducting research enormously complex. The archive is somewhere around twenty-five million pages. “Thanks to the investment of 3.8 million euros from the NWO Large-Scale Infrastructure programme, we can change this”, Vossen says.
A team of data experts led by Vossen will collect and structure all relevant contextual data from the countless publications on the history of the VOC. It involves recognizing names of persons, organizations, places and events and subsequently converting them into a so-called knowledge graph that models the world of that time. It also looks at the perspectives that the Dutch VOC has on those events and persons.
The technology used to digitize the VOC archive builds on the "reading machines" that Vossen previously developed in Biographynet, Newsreader and recently CLARIAH-PLUS. “This is software trained with AI that recognizes expressions as names using examples labeled by historians. The software then looks up those names in a database and checks who or what it concerns. A name can refer not only to a person, but also a place or a ship. Furthermore, many persons have the same name, so who is it in this text? The same applies to events. The AI software learns from examples to recognize the events in which those people, ships and places are involved from the words and phrases in which they are mentioned. Think of conflicts, disasters, harvests, trade, batches of goods, prices etcetera. This concerns old Dutch, so existing software must be adapted for these texts ,” according to Vossen.
GLOBALISE project
The GLOBALISE project is a consortium consisting of Huygens ING, Vrije Universiteit Amsterdam, the National Archives of the Netherlands, International Institute of Social History and the KNAW Humanities Cluster. The aim of the project in the period 2021-2026 is a digital scientific infrastructure that will open up the most important series of VOC reports for advanced new research methods.