Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - KOPAR (Knowledge Publishing, Acquisition and Representation)

Teaser

Scientific publications aggregate data by encompassing it within a persuasive narrative. Such aggregation is federated as authors reference external sources, analyse data elsewhere and summarize everything over the document. The KOPAR project addresses the problem of...

Summary

Scientific publications aggregate data by encompassing it within a persuasive narrative. Such aggregation is federated as authors reference external sources, analyse data elsewhere and summarize everything over the document. The KOPAR project addresses the problem of supporting such aggregation over a document that is to be born semantic. This means, a document with characterized entities that are meaningful for machines to process. Conceptual elements such as Research Objects, nanopublications, references, experimental protocols and data are to be logically assembled within the document as self- describing, machine processable elements. The prototype illustrating how this research is put into practice was developed in collaboration with ZPID (psychology information center for the German-speaking countries). The prototype delivers a simple framework upon which functional features over characterized entities are made available for the end user. In this way, for instance, authors are recognized by software and information from other sources, e.g. CrossRef, is aggregated. Then, everything is made available to the end user; thus, enhancing the experience and making extensive use of resources available over the web. Furthermore, recognized entities in the content may also be used by other resources in the web -bidirectional reuse. The KOPAR approach is simple, it enables documents to be born semantics and interoperable. By doing so, reuse and be reused as well as enhance functionality over meaningful entities in the content. The inquiring society understands web content as interactive content; the experience laymen have with web content is rarely static. Also, content in the web is interoperable; this is understood and extensively used by web developers. These two characteristics are not to be found with scientific content. Although this content is interactive in nature, data can be presented in many ways; similarly, data can be enhanced with several information sources readily available over the web. Having interoperable content helps researchers to tailor functional features in order to make a case that is valid across several audiences. For instance, a graphic that can easily be customized may be used to explain a phenomenon to non experts. Being able to develop by reusing scientific content also makes it possible for non scientific web developers to build functional features depending on the audience they are targeting. Furthermore, having semantics in web content also makes it simpler for researchers to find information by being more specific when building queries.

Work performed

\"We started by organizing a Dagstuhl meeting “Digital Scholarship and Open Science in Psychology and Behavioural Sciences”. During this meeting we discussed issues in open science related to these specific domains. The work continued by addressing some of the identified problems; in particular that of using existing semantic, NLP and data infrastructure in psychology within the framework provided by ZPID. We have built the semantic layer for Psych Open, an open access publisher in psychology. Our approach uses linked open data (LOD) technology and addresses four important aspects in the publication workflow, namely, queryability, discoverability, interoperability and user experience. In order to address the former three aspects we have developed SE4OJS to generate semantically annotated documents from the JATS/XML that comes from the OJS. We are adding semantics to our content, Psych Open, and publishing it as LOD in RDF format; this semantic enrichment adds context to the metadata of the publication, it also creates an anchor to the content. The datasets we are building for each article include i) the RDF description of article metadata, e.g. information on title, keywords, authors and editors; ii) structural information such as sections, section types, paragraphs, in-text citations; and also, iii) RDF for content and annotations. In order to enhance the user experience we are bringing the semantics to the user interface. We have extended LENS, a Java Script toolkit for scientific publications; as our content in RDF has semantically characterized entities, we have adapted LENS to expand on these entities. This enhancement delivers an experience similar to that of \"\"apps\"\" in mobile devices; characterized entities are processed by \"\"apps\"\", thus delivering an enriched user experience over the LENS interface. Also, we have added annotation capabilities to LENS by adapting hypothes.is. This annotation tool kit was also extended in order to support real time annotation with ontologies; thus facilitating conceptual navigation on the paper. Over a single environment we are supporting the annotation of the PDF as well as of the corresponding JATS/XML manifestation of the same document. We are structuring the annotations by using the Annotation Ontology. The KOPAR approach makes it possible to build concept-based queries; we are delivering a flexible, reusable and adaptable set of tools for metadata enrichment, semantic processing and enhancement of scientific documents in psychology.

A similar approach was used against the open access full content subset of PubMed Central -this is the largest digital library in the biomedical domain. A framework for generating RDF for PMC was developed. Use cases illustrating how to make practical use of semantics were also defined and implemented. A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. We have built a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines using existing infrastructure from the National Center for Biomedical Ontology. We expose our models, services, software and datasets. We illustrate the utility of our system with several use cases and make possible to SPARQL over the resulting datasets
\"

Final results

The KOPAR project has delivered the first semantic and linked data infrastructure in psychology. Using semantics and named entity recognition in order to generate linked open data had never been done in psychology. The KOPAR project improved, and built upon, existing work delivering linked data for biomedical publications. It delivers new infrastructure that is compatible with existing publication workflows. It has been proven in two scenarios, ZPID and PsychOpen. and also that of PMC. Industry, publishers in particular, may reuse the infrastructure built during this project or, build their own specific semantic and linked data infrastructures. Either way, the KOPAR project proved the feasibility of having linked data as part of the publication workflow. Furthermore, this project also made it clear what this infrastructure could be used for in terms of end user interactions. It also made it easier to understand the importance of interoperable content in the scientific domain as an added value in an industry that should undergo profound changes -one that has strategic importance for the European economy.

Website & more info

More info: http://kopar.oeg-upm.net/.