Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - NewsEye (NewsEye: A Digital Investigator for Historical Newspapers)

Teaser

Newspapers collect information about cultural, political and social events in a more detailed way than any other public record. Since their beginnings in the 17th century they are recording billions of events, stories and names, in almost every language, every country and...

Summary

Newspapers collect information about cultural, political and social events in a more detailed way than any other public record. Since their beginnings in the 17th century they are recording billions of events, stories and names, in almost every language, every country and every day. Newspapers were always an important medium for the dissemination of public and political opinions, literary works, essays and art. This thematic wealth sets them at the center stage for anyone interested in European cultural heritage.

In the last decades, tens of millions of newspaper pages from European libraries have been digitized and made available online, while national libraries will intensify their digitization efforts in the coming years. There is large demand for access to historical newspapers. At this very moment, probably thousands of European citizens are accessing digitized versions of historical newspapers utilizing digital library services. Whilst the broad public shows general interest in this historical and cultural resource, it is of crucial importance for many humanities scholars.

The NewsEye project involves national libraries, humanities and social science research groups and computer science research groups. It addresses a number of challenges, which will result in significant scientific advances, in several directions:
- in text recognition, text analysis, natural language processing, computational creativity and natural language generation, with regard to historical newspapers but also more universally,
- in digital newspaper research, addressing a number of editorial issues like OCR and article separation,
- in digital humanities, in respect to huge amounts of text material, availability of useful tools and possibilities of searching and browsing,
- in history, in terms of analyzing historical assets with new methods across different language corpora.

Work performed

\"During its first 12 months of existence, NewsEye has reached 2 important milestones.

The first 6 months paved the way for the remainder of the project and validated its initial step, allowing to kick off the process of incremental and iterative feedback and improvements. According to plan, these first 6 months led to making the first data sets available and accessible within the project, thus allowing partners to work on real project data, a first milestone internally identified the \"\"data\"\" milestone.

The end of the first year led to state-of-art versions for all the project\'s tools, from now on providing access to DH partners for experiments and feedback from then on and until the end of the project. The produced outputs allowed:
- updating data models, data collection and preservation and describing principles for data generation;
- producing first versions for layout analysis , automatic text recognition, article separation, NE recognition and linking, stance detection, analysis of content in a given context and the personal research assistant’s investigator and reporter;
- constructing an exhaustive analysis of existing tools for the contextualised case studies for academic use;
- finalising the tool for querying the subsequently enriched data sets now integrated into the NewsEye demonstrator (see below);
- extending and updating on the project’s external and internal communication, dissemination and exploitation the launch of the project website, and last but not least with the first release of the NewsEye demonstrator, the central access point for all users.
According to plan, these first 12 months of work let all tools prepare first usable versions, now accessible through the NewsEye demonstrator, thus matching the criteria for the milestone internally defined as “Prototype”, timely allowing the expansion of collaboration and the development of research contributions and innovations.

At the end of year 1, NewsEye is on track to reaching its end of project targets.\"

Final results

By building on one of the largest and most significant digital collections of cultural heritage in Europe, the core NewsEye objective is to deliver innovative tools and services that will significantly improve the way historical newspapers can be accessed, explored and analyzed, ensuring widespread use and large impact. The project aims to create a valuable, inexpensive, and immediately useful NewsEye toolbox for assisting users of all types. The developed toolbox will be composed of four main layers, each providing advanced techniques and tools for:
- Text Recognition and Article Separation, aiming to extract the layout of newspapers (e.g., articles and graphical regions) from digitized newspapers and to transform the content to textual format, providing full articles through automatic layout analysis, text recognition and article separation.
- Semantic Text Enrichment, aiming to enhance the utility of the newspaper collections by enriching the texts with higher-level semantic annotation using named-entity recognition. Extracted named entities will be linked to external references (such as the Wikipedia) across languages, with the goal to support multilingual analysis. This layer will ensure also keyword and event detection, as support for pattern discovery from textual contents.
- Dynamic Text Analysis, aiming to provide tools to exploit the enriched data for more elaborated analysis of user-selected newspaper content, supporting interactive queries to discover different viewpoints, sub-topics or trends concerning the selected topic, named entity, newspaper, timeframe or other category, so as to provide insights into the newspaper collection in contextualized and comparative manners.
- Intelligent analysis and reporting (“Personalized Research Assistant”), aiming to provide an alternative, “intelligent” interface to the other tools and the data, carrying out iterative cycles of analysis and reporting to the user in natural language. The user should be able to authorize the Personal Research Assistant to investigate a given topic (or time window or newspaper etc.) on the user’s behalf, and the Assistant will report back on findings which it assesses as potentially interesting for the user, together with a rationale for how they were found and why they might be interesting, all in natural language and in a transparent manner so the findings can be understood and verified by the user. Given the European context, we will be able not only to analyze newspapers written in multiple languages but also to report on the findings in multiple languages; to this end, the Assistant will use multilingual natural language generation (NLG) to produce textual descriptions of the results obtained by the Investigator. In NewsEye, special focus will be on French, German, Finnish and Swedish (as in the newspaper collections), and English as the common project language.

The NewsEye consortium further involves experts whose role is to ensure (i) some additional technical expertise in the above-mentioned aspects, (ii) access to and enrichment of digitized newspapers, (iii) insight and experience in using historical newspapers as a rich cultural heritage resource for the understanding of developments in society, economy and politics, (iv) use cases with the aim to address important humanities’ research desiderata and gain experience and feedback to guide iterative development of the NewsEye toolbox, and (v) strong dissemination and viable paths towards wider adoption and sustainability of the developed tools.

Website & more info

More info: https://www.newseye.eu/.