Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - ERC-EuropePMC-2-2014 (Extracting funding statements from full text research articles in the life sciences)

Teaser

Funding agencies need to be able to identify the outcomes of research in order to assess the impact of funding across different themes and, in the case of the European Research Council (ERC), also among different disciplines and geographical areas. In the life sciences...

Summary

Funding agencies need to be able to identify the outcomes of research in order to assess the impact of funding across different themes and, in the case of the European Research Council (ERC), also among different disciplines and geographical areas. In the life sciences, research articles are the core currency of research assessment, therefore identifying articles that have been supported by a given funding agency, and through a particular grant or funding stream, is vitally important. Currently the only way to identify papers in Europe PMC that have been funded by the ERC are:
(1) through metadata associated with the article, usually because the PI has used the grant-linking tool in Europe PMC or a similar tool, post-publication of the article
(2) through metadata received from OpenAIRE, based on FP7-based funding.
Relying on these methods alone grossly underestimate the number of articles that can be attributed to the ERC.
(3) through free text search of full text, which can be error prone due to noise from the full text content of the article, and this approach also does not scale to look for specific grant IDs.

The objectives of the project can be described as follows:

Strategic objectives of the project:

(1) To support the ERC’s visibility as a funder of excellent research by enabling the showcasing of ERC supported research results. Once funding statements are identified they will be displayed prominently as distinct metadata on Europe PubMed Central abstracts and incorporated into searchable fields and programmatic web services.
(2) To support the visibility of ERC grantees (the fact that they have obtained an ERC grant, which is a label of excellence).
(3) To facilitate the analysis of the impact of ERC funding. Unstructured and hard-to-find funding information in full text articles in Europe PubMed Central will be surfaced and made available for straightforward searching and filtering.

Operational objectives of the project (supporting the achievement of the strategic objectives):

(1) To accurately identify statements in full text articles that attribute ERC funding schemes and grants.
(2) To integrate ERC-based funding statements into the Europe PMC interfaces and search.

Work performed

In order to improve the recall of articles attributed to the ERC within the Europe PMC database, we have developed a new text-mining module for the Europe PMC text mining pipeline. This module accurately identifies statements in full text articles that attribute both ERC funding schemes and grants. In order to achieve this objective, we report the following tasks completed:
(1) Development of the algorithm that identifies ERC funding statements including specific grant IDs. This algorithm, given sentences from Acknowledgements-type sections of articles, first identifies funding IDs using pattern matching, and then validates those IDs based on contextual information (such as the occurrence of the phrase “European Research Council”) within each sentence.
(2) Analysis of the scope and quality of outputs, iteration to improve the algorithm and allow the potential extension to other Europe PMC funders.
(3) Integration of the European Research Council Funding Statements Extraction Algorithm developed into the full-scale Europe PMC public services so that the algorithm operates daily on all new full text content entering Europe PMC and the outcomes are available in the public interfaces via simple search terms.

Final results

(1) Prior to the integration of the developed algorithm, just over 2000 full text articles in Europe PMC could be retrieved using the following query: (http://europepmc.org/search?query=%28GRANT_AGENCY:%22European+Research+Council%22%29+AND+IN_EPMC:Y. (This can easily be constructed via the Advanced Search page.) At the time of integrating the algorithm as complete, i.e. after the application of the algorithm described in D1.1 to all the full text content in Europe PMC, this number has risen to around 4,724 full text articles (14th September 2015).
(2) We expect the number of articles attributed to the ERC to rise in the future, both through the collection of attributions by the more traditional methods, but now also as a result of text mining the information out of full text articles incoming on a daily basis. This dataset may provide insight into trends regarding how researchers attribute their funding sources.
(3) The source code for the algorithm developed to extract ERC funding statements is available on GitHub (https://github.com/jeekim/EuropePMC-Identifier-Extractor).

Website & more info

More info: http://europepmc.org.