Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - GeCo (Data-Driven Genomic Computing)

Teaser

\"Genomic Computing (GeCo) is a new data-driven basic science for the management of sequence data. GeCo research is based on a simple driving principle: data should express high-level properties of DNA regions and samples, high-level data management languages should express...

Summary

\"Genomic Computing (GeCo) is a new data-driven basic science for the management of sequence data. GeCo research is based on a simple driving principle: data should express high-level properties of DNA regions and samples, high-level data management languages should express biological questions with simple, powerful, orthogonal abstractions. The essence of this research is to rediscover the simplicity of driving principles in data-driven computing which have not been exploited by the recent developments of bioinformatics. Although this idea is very simple, putting it in action is far from trivial, as it requires a radical change of the dominant approach. Along these principles, it is possible to build a progressive revolution of genomic computing, towards important outcomes, such as the integrated access to large repositories of sequences and the building of an Internet of genomic computing services providing Google-like processing and search, bringing potentially huge advantages to modern biological and clinical research.

Along these principles, the GeCo project is building important outcomes:

1. Developing and exploiting a new core model for genomic processed data. There is a need for a simple data model centered on the notion of \"\"experimental sample\"\" including both genomic information (with a region-based organization) and metadata (generic properties of the sample, including biological and clinical properties and provenance), that makes genomic data comparable across heterogeneous format, encompassing the diversity of data formats which have been developed in the past. The model was developed between the submission and the start of the project, and has been published on the Journal Methods, Dec. 2016, \"\"Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying\"\".

2. Developing new abstractions for querying and processing genomic data, by means of a declarative and abstract query language rich of high-level operations, with the objective of enabling a powerful and at the same time simpler formulation of biological questions w.r.t. the state-of-the-art, evolving the work that is cited in GeCo Description of Action (appeared on the BioInformatics journal, Feb. 2015, \"\"GenoMetric Query Language: A novel approach to large-scale genomic data management\"\". ) Several GeCo publications describe GMQL implementation and use.

3. Bringing genomic computing to the cloud, within highly parallel, high performance environments; by using new domain-specific optimization techniques, computational complexity is pushed to the underlying computing environment, producing optimal execution which is decoupled from declarative specifications. We target open-source cloud computing environments that take advantage of wide developer communities, so that our domain-specific work will leverage the general progress of cloud computing. Several GeCo publications describe technologies and optimization methods which apply to GMQL deployment on the cloud.

4. Providing an integrated repository of open data, available for secondary data use, in accordance with our obligations on ethical issues as discussed in the Description of Action. Unified access requires breaking barriers which depend both on data semantics and data access, so this work requires both ontological data integration and new query interaction protocols. The conceptual model of the repository metadata was presented at the International Conference on Entity-Relationship Approach, Nov. 2017, \"\"Conceptual modeling for genomics: Building an integrated repository of open data\"\". The current publicly available repository of open data includes 28 datasets from 8 sources and about 240K samples.

5. During GeCo, we contribute to basic science not only in computer science but also from an interdisciplinary point of view (targeting advances in biology and medical science), as we participate to studies for solving biological or clinical problems, of course\"

Work performed

The major achievements of GeCo during the first 36 month period essentially confirm the project planning. They are concerned with WP1 (closed at M18), WP2 (closed at M30), WP3 (ongoing), WP4 (ongoing), WP5 (ongoing) and WP7 (ongoing). They include the full deployment of GMQL Version 2.0 (R1.2) and associated GDM model (R1.1) together with the first version of the open data repository (R4.1) and the protocol for remote data access (R5.1). Collectively, these results produce a significant suite of technological platforms supporting biologists in the tertiary analysis of big genomic datasets. The GMQL prototype is open for public use through the current deployment at the CINECA supercomputing site; the system is currently supported by a Web-based interface, two libraries for use in R and Python, and a connector to the data flow language FireCloud, supported by the Broad Institute (Cambridge, MA), in the form of a “featured workspace” (announced on the FireCloud blog in June 2018) and ported to the new Terra big data management system for genomics. In addition, we have delivered a repository of open data, to be used with GMQL. The first version of the repository of open data (R4.1), available to GMQL users, includes 28 datasets from 8 sources and about 240K samples.

Final results

1. Providing the required abstractions and technological solutions for improving the cooperation of research or clinical networks (i.e., the members of a same research project or international consortium) through federated database solutions, in which each center will keep data ownership, and queries will move to remote nodes and will be locally executed, thus distributing genomic processing to data.

2. Providing unified access to the new repositories of processed NGS data which are being created by worldwide consortia. Unified access requires breaking barriers which depend both on data semantics and data access, so this work requires both ontological integration and new interaction protocols. Currently, metadata-driven access is supported at each individual repository through specific interfaces; this must be generalized and amplified, and augmented by providing search methods. We also aim at providing user-friendly search interfaces on top of integrated repositories.

3. Promoting the evolution of knowledge sources into an Internet of Genomes, i.e. an ecosystem of interconnected repositories made available to the scientists’ community. The dream is to offer improved points of access to world-wide available genomic knowledge, by leveraging on new services, including metadata indexing and domain-specific crawlers; the long-term vision for planning the future of data-driven genomics - initiated by projects such as GeCo - is developing Google-like systems supporting keyword-based and user-friendly region-based queries for finding genome data of interest available world-wide.

Website & more info

More info: http://www.bioinformatics.deib.polimi.it/geco/.