To date most successful machine learning techniques for the analysis of complex interdisciplinary data predominantly use significant amounts of vectorial measurements as input to a statistical system. The domain expert knowledge is often only used in data preprocessing and the...
To date most successful machine learning techniques for the analysis of complex interdisciplinary data predominantly use significant amounts of vectorial measurements as input to a statistical system. 
The domain expert knowledge is often only used in data preprocessing and the subsequently trained technique appears as a black-box, which is difficult to interpret or judge and rarely allows insight into the underlying natural process. 
However, in many bio-medical applications the underlying biological process is complex and the amount of measurements is limited due to the costs and inconvenience for the patient. 
The main aim of this project is the formulation of a generalised framework for learning in the space of probabilistic models representing the complicated underlying natural processes with potentially very few measurements.
The project combines the expertise of the Fellow (Dr. Bunte) in task-driven similarity learning and dimensionality reduction with the expertise of the Host Coordinator Prof. Tino at The University of Birmingham (UoB) in probabilistic modelling, dynamical systems and model-based learning. 
The UoB and all participants (University of Sheffield, Warwick and the company Diurnal Ltd) provide further bio-medical and modelling expertise. 
We started the computational investigation of the complex system of Adrenal Steroidogenesis. 
Inborn disorders of steroidogenesis constitute a complex of rare congenital conditions resulting from inherited mutations in genes encoding enzymes that catalyse steps in steroid hormone production. 
These include the variants of congenital adrenal hyperplasia (CAH), which manifest with a combination of adrenal insufficiency (AI) and disordered sex development (DSD). 
These are caused by mutations in steroidogenic enzymes, such as 21-hydroxylase (CYP21A2), 17-hydroxylase (CYP17A1), P450 oxidoreductase (POR), and 3a-hydroxysteroid dehydrogenase type 2 (HSD3B2). 
In addition, there are disorders that manifest only with DSD, caused by mutations in 17a-hydroxysteroid dehydrogenase type 3 (HSD17B3), 5a-reductase type 2 (SRD5A2) or cytochrome b5 (CYB5A) (Fig.1). 
Inborn steroidogenic disorders primarily present in the paediatric population and need to be diagnosed as early as possible to avoid high mortality if lifesaving glucocorticoid therapy for AI is delayed, and to facilitate gender allocation and surgical planning in patients with DSD. 
Providing early diagnosis is difficult and existing diagnostic tests in the first days of life carry high false positive rates, are difficult to interpret and miss rarer enzyme deficiencies. 
In the initial state of the project we analyzed a data collection of urinary steroid metabolomics. 
The approach of comprehensive profiling by gas chromatography-mass spectrometry (GC-MS; Fig. 2) was pioneered in the lab of our collaborator Prof. Arlt. 
The rich data collection comprises urine GC/MS data from 829 healthy controls (305 under 1 yr of age) and 118 genetically confirmed CAH and DSD patients, including one of the largest cohorts of P450 Oxidoreductase Deficiency (POR) patients in existence.
Based on these steroid fingerprints we developed an interpretable machine learning algorithm for the computer aided diagnosis of inborn steroidogenic disorders. 
We had to overcome common challenges faced in clinical practice when analysing metabolomics data - such as incomplete or inconsistent data sets, variance in urine collection method (e.g. spot vs. 24hour) and variance in age or sex.
Trained on a subset of three largest disease cohorts (CYP21A2, POR and SRD5A2 deficiencies) and healthy controls our method has an excellent sensitivity and specificity of 97% to detect condition versus healthy control. 
From an individual point of view the Research Fellow Dr. Bunte has received important training which prepared her perfectly for her current tenure track Rosalind Franklin Fellow position at the University of Groningen in the Netherlands. 
She collected de
Detailed explanation about the work performed in the period covered by the report can be found in the Report Core section 1.1 to 1.2.
Our proposed development showed already promising results for data with very heterogeneous measurements, which are not missing at random but systematically.
Furthermore, the proposed computer aided diagnosis system has an excellent sensitivity and specificity of 97% to detect condition versus healthy control trained on a subset of the three largest disease cohorts (CYP21A2, POR and SRD5A2 deficiencies) and healthy controls.
Furthermore, we are preparing a journal paper combining principles from Applied Mathematics, Engineering and Computer Science for clustering of pharmakokinetic models.
We expect that further development based on these principles will lead to further improvements with regard to treatment titration and monitoring enabling more personalised treatment for individuals.
More info: http://www.cs.rug.nl/.