Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - READ (Recognition and Enrichment of Archival Documents)

Teaser

The history of Europe is preserved in it’s archives. Thousands of shelf-kilometres containing billions of documents provide a true picture of everyday life (and struggles) of Europeans citizens from the Middle Ages until today. However these treasures are hard to access:...

Summary

The history of Europe is preserved in it’s archives. Thousands of shelf-kilometres containing billions of documents provide a true picture of everyday life (and struggles) of Europeans citizens from the Middle Ages until today. However these treasures are hard to access: Even after digitising a complete archive searching millions of pages for specific word or phrase was not possible. This situation has changed dramatically. With the technology developed in the H2020 project READ (Recognition and Enrichment of Archival Documents) access to historical collections from archives and libraries is revolutionzed. Main input comes from cutting edge research in Pattern Recognition, Computer Vision, Natural Language Processing and Digital Humanities. Namely Handwritten Text Recognition and Keyword Spotting are key technologies where European universities are at the forefront of research. These technologies are made available via the service platform “Transkribus”. It offers the world’s first implementation of a freely available Handwritten Text Recognition engine, capable of being trained on medieval handwriting found in codices in the same way as on individual handwriting from famous persons of the 20th century. The main European scripts can be trained and recognised, as well as Hebrew, Arabic or Bangla.
The Virtual Research Environment “Transkribus” aims to provide benefits for all user groups involved in the “eco-system” of historical documents: Archives and libraries as content holders get the chance to enrich their documents on a large scale with full-text transcription and searching, (digital) humanities scholars are enabled to work intensively with historical documents in a sheltered and highly specialized environment, computer scientists are supported with large scale datasets and reference data finally the public is supported to enjoy the benefits of accessing digital archives. More than 9000 users are today subscribed in the Transkribus platform contributing with their documents, knowledge and engagement to the further development of the platform. Collaboration agreements were concluded with more than 60 institutions from all over the world.

Work performed

Our work in Y1 and Y2 of the project focused on two main areas:
First of all we set a bunch of activities to make the project and the technology known to our four target groups. This started with a three days conference combing the public kick-off meeting of the project with a convention meeting of the co:op project. More than 150 people from over 20 countries took part in the conference. Videos of the presentations are online and an important resource for dissemination activities. Reactions on the conference were highly positive and opened the door to many archives and research groups. Dissemination activities were continued on several channels e.g. more than 20 workshops were organized by several groups in the project and held in a number of countries (Austria, France, Germany, Netherlands, Finland, Denmark, Norway, Italy, Switzerland, United Kingdom, Spain). Hundreds of people took part in these workshops and got familiar with the expert tool from the Transkribus platform.
Based on the overwhelming interest of archives and research groups in the project we were able to conclude more than 60 Memorandum of Understandings with institutions. These MoUs provide an excellent framework for cooperation. Among these are the Hessian State Archive (Germany), the Archivo Storico Ricordi (Italy), Huygens Institute for the History of the Netherlands (Netherlands), Alfred Escher Foundation (Switzerland) or The Linnean Society (United Kingdom), to mention just a few of this list. As a result we can now count more than 9000 registered users in the Transkribus platform. representing archivists, librarians, researchers, scholars and public users (family historians) from all over Europe and abroad.
Our second focus was the implementation of the Transkribus platform integrating a number of tools developed by the research groups in the project. Special attention was given here to defining interfaces and data exchange formats, to set up application servers for easy deployment of the single tools (which are coming in different operating systems and computer languages) and also to tackle the challenge of being able to store and process millions of images files. As a highlight of Y1 the award winning Handwritten Text Recognition engine from the CITLab team of the University of Rostock was implemented in the Transkribus platform. In Y2 major progress was made so that today Transkribus is able to offer the complete workflow for a text recognition project including the training of neural networks as well as keyword spotting.

Final results

After two years of work we can report about several breakthroughs achieved during the project. (1) Signficant drop in error rates for handwritten text recognition. Compared to baseline results from the start of the project an improvement of 50% or more were achieved by the leading teams in this domain, such as PRHLT from the Technical University Valencia and the CITlab team from the University of Rostock. (2) A real breakthrough took place in the layout analysis domain. Here it turned out that the approach to create a large set of training data for a scientific competition and the incorporation of machine learning methods led to a dramatic improvement of this basis technology. The best results from the ICDAR 2017 cBAD competition are even exceeded by the teams of the READ project. (3) Major progress was made in making the technology available to archives, libraries, humanities scholars and family historians via the Transkribus platform. The whole workflow of a text recognition project is covered and can be used by any registered user. (4) Significant progress was also made in terms of innovation. A specific device (ScanTent) and an app (DocScan) where created as prototype applications and will be marketed in Y3 of the project.

Website & more info

More info: http://read.transkribus.eu/.