Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - LineageDiscovery (Laying the Biological, Computational and Architectural Foundations for Human Cell Lineage Discovery)

Teaser

What is the problem/issue being addressedOur LineageDiscovery study is addressing for the fundamental open questions in human biology and medicine, questions that relate to the human cell lineage tree. An organismal cell lineage tree is a rooted, labelled binary tree where...

Summary

What is the problem/issue being addressed
Our LineageDiscovery study is addressing for the fundamental open questions in human biology and medicine, questions that relate to the human cell lineage tree. An organismal cell lineage tree is a rooted, labelled binary tree where nodes represent organism cells, edges represent progeny relations and labels capture cell state. The cell lineage tree of an adult human has about 100 trillion nodes. Our study will investigate the human cell lineage structure, dynamics and variance in development, adulthood and ageing, during disease progression, and in response to therapy.
Why is it important for society?
Despite decades of research, we do not know yet how a cancer metastasizes, rendering it lethal. Hypotheses abound, but a definite answer has not emerged, in part due to lack of adequate methods. A human cell lineage tree that captures cancer history, heterogeneity and topography at cellular resolution has never before been attempted and it promises ground-breaking insights into cancer biology and therapy. New knowledge of the scale provided by the human cell lineage analysis will have profound understanding in developmental biology, the landscape of immune system maturation, and stem cells dynamics. For example, which cancer cells give rise to metastases? Is relapse after chemotherapy caused by ordinary tumor cells escaping chemotherapy stochastically, or by cancer stem cells that escape chemotherapy due to slow division rate? Do beta cells/oocytes/neurons/heart muscle cells renew during adulthood? Do neural progenitor cells produce any brain cell type, or do specialized progenitors each produce one type of brain cell? Moreover, unraveling the dynamics of diseased cells, which depend on the specific cellular microenvironment and stochastic events, through their cell lineage tree can help in selecting the appropriate treatment, thus facilitating the advancement of personalized medicine.
What are the overall objectives?
The overall objective of Lineage Discovery is to lay the biological, computational and architectural foundations for an envisioned large-scale human cell lineage discovery project, and to establish its feasibility and value via collaborative proof-of-concept cell lineage discovery experiments built on these foundations, ideally leading to its launch.
We plan to achieve this overall objective by focusing on three Research Objectives:

Research Objective 1: Develop a prototype efficient cell lineage discovery workflow. We aim to develop a prototype biological-computational cell lineage discovery workflow. The biological component of the workflow starts with human samples (e.g. blood samples, biopsies, autopsies), isolates individual cells, and (a) applies single-cell genomics to produce single-cell data. The computational component converts this data into knowledge: Initially just (b) knowledge of the lineage trees of the sampled cells, but as trees accumulate, also (c) knowledge of the underlying cellular state dynamics that led to the formation of these trees. It will include: algorithms for single-cell genomics data analysis; algorithms for calling single-cell somatic mutations; algorithms for cell lineage tree reconstruction utilizing somatic mutations; algorithms for inferring internal node labels (cell state) from leaf labels in cell lineage trees; and algorithms for discovering cellular state dynamics in fully-labelled cell lineage trees. It will also include a meta-system for workflow validation and optimization that supports: methods and algorithms that measure signal loss and noise by the biological component and its effect on the computational component; independent testing of the computational component; and assessing competing computational components such as signal analysis algorithms and cell lineage discovery algorithms.
Research Objective 2: Design, implement and deploy a prototype scalable architecture for collaborative human cell lineage discovery. We

Work performed

The overall objective of LineageDiscovery is to lay the biological, computational and architectural foundations for an envisioned large-scale human cell lineage discovery project, establish its feasibility and value via collaborative proof-of-concept cell lineage discovery experiments built on these foundations, ideally leading to its launch.
We plan to achieve this overall objective by focusing on three research objectives. The first addresses the biological and computational foundations of effective human cell lineage discovery. The second addresses the architectural foundations of a scalable human cell lineage discovery project. The third addresses the feasibility and value of human cell lineage discovery (FIGURE 1).

Below are detailed technological achievements for each of the three objectives:
Research Objective 1: Develop a prototype efficient cell lineage discovery workflow.
We have developed a Duplex MIPs based cell lineage workflow that is composed of (a) Design of duplex MIPs precursor: Desired targets are selected from our cell lineage database and precursors are designed; (b) Duplex MIPs preparation: duplex MIPs precursors are synthesized on microarray, collected and amplified by PCR as a pool. PCR product is then digested to remove the universal adaptors (red and green); the digested product is purified and diluted to obtain active duplex MIPs; (c) Duplex MIPs and template DNA are mixed together, the targeting arms (blue and yellow) anneal to the flanking regions of the targets and the MIPs are then circularized by gap filling with DNA polymerase and ligase. Linear DNA, including excess MIPs and template DNA, is digested by exonucleases and an Illumina sequencing library is generated by adding adaptors and barcodes using PCR for each sample separately. Libraries are pooled and sequenced by Illumina NGS platform, followed by analysis of the raw reads to detect mutations. This mutation information is then used to infer the cell lineage tree (FIGURE 2, FIGURE 3, FIGURE 4).

Research Objective 2: Design, implement and deploy a prototype scalable architecture for collaborative human cell lineage discovery.
An end-to-end automated system for the analysis of SC DNA, targeted for 14K MS loci has been implemented and deployed in the Weizmann servers farm (FIGURE 5).

Outline of the Workflow Database and Management System (a). Sampling - sampling documentation from patient to DNA, paired with User Interface for viewing, searching and documenting sampling components; (b). Targeted Enrichment – documenting target selection and probe design; (c). Analysis – steps from NGS raw data to Tree across multiple tools, versions and parameters. Paired with the Dask-Distributed package for computing clusters.

eSTGt: A programming and simulation environment for population dynamics background. We have previously presented a formal language for describing population dynamics based on environment-dependent Stochastic Tree Grammars (eSTG). The language captures in broad terms the effect of the changing environment while abstracting away details on interaction among individuals. An eSTG program consists of a set of stochastic tree grammar transition rules that are context-free. Transition rule probabilities and rates, however, can depend on global parameters such as population size, generation count and elapsed time. In addition, each individual may have an internal state, which can change during transitions. Our work presents eSTGt (eSTG tool), an eSTG programming and simulation environment. When executing a program, the tool generates the corresponding lineage trees as well as the internal states values, which can then be analyzed either through the tool’s GUI or using MATLAB’s command-line environment. The presented tool allows researchers to use existing biological knowledge in order to model the dynamics of a developmental process and analyze its behavior throughout the historical events. Simulated lineage trees can be used t

Final results

According to FIGURE 1, Note that we have already achieved RO1’s 2nd Cycle Objective of 50K targets. We are well on our way to achieve RO2’s 2nd Cycle Objective of “Workflow deployed at several Single-Cell Genomics Centers worldwide”, with researchers at both the University of Regensburg and at the Broad Institute endeavoring to deploy our workflow.

Website & more info

More info: http://www.weizmann.ac.il/math/Shapiro/.