Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - ExCAPE (Exascale Compound Activity Prediction Engine)

Teaser

Computing power has increased greatly over the last few decades. Despite this increase, there are various applications whose requirements exceed the computing power of standard machines. To tackle this, specialised machines are periodically constructed with the best available...

Summary

Computing power has increased greatly over the last few decades. Despite this increase, there are various applications whose requirements exceed the computing power of standard machines. To tackle this, specialised machines are periodically constructed with the best available technology to provide a very large amount of concentrated compute power, called high performance computers (HPC), to give the best possible answers for such demanding applications. The next generation of HPC, expected sometime after 2020, is called Exascale.
Traditional users of HPC have done simulations of one type or another. In the last decade, a new breed of HPC has appeared, those concerned with Big Data. Doing simulations is mostly about large amounts of computation to observe the behaviour of a sophisticated model with few parameters. Big Data problems, by contrast, deal with less sophisticated models but with many more parameters, and try to choose the parameters by analysing large amounts of data. Folk wisdom in this field states that the ability to capture and analyse more data is more valuable than making more sophisticated models, and this works well when data is cheap and easy to get. However, there are problems for which the data are very expensive to collect. In this case it becomes important to be able to use more sophisticated models to be able to squeeze as much knowledge as possible out of the expensive data. Such problems are at the juncture of HPC and Big Data in that they have large data sets to analyse, yet should exploit more sophisticated models through computation to make the most of the available data.
The ExCAPE project is about how to tackle such problems. The core of the project is on maths and software and how they work on HPC machines. However, to be able to advance the state of the art it helps to have a concrete problem to tackle. For this we take the chemogenomics problem: predicting the activity of compounds in the drug discovery phase of medical research, hence the project name - Exascale Compound Activity Prediction Engines. Chemogenomics is a Machine Learning problem.
The objectives of the project are to find methods and systems that can tackle large and complex machine learning problems, such as chemogenomics. This will require algorithms and software that make efficient use of the latest HPC machines. Creating these, along with preparing the data to give the system something to work on, is the main work of the project.
The project team is composed of 9 top research institutions and companies from 8 different countries in Europe, working together to advance the state of technology.
The final conclusions of the project are that it is possible to improve the results for this type of problem by using more sophisticated machine learning techniques that require large amounts of compute power, and further more that some of these techniques can be adapted to work more efficiently on HPC machines. We have shown these results through large scale tests using over four million core hours, compute power equivalent to roughly 35 years of run time on a modern server.

Work performed

The project has now finished. During the project, we have done research on algorithms, software, and produced cleaned-up data sets to use to test the system and make chemogenomics predictions from. The majority of the software and data sets have been released publicly as Open Source and Open Data. How the activities are linked is show in the diagram.
The algorithmic work has focused on four areas; deep learning, matrix factorisation, conformal prediction and clustering. We have created new types of deep learning networks that can be more easily trained and designed networks to use on the chemogenomics problem, and shown how to chop up linear and non-linear matrix factorisation problems into smaller pieces. We have shown how to combine different models with conformal prediction to make more accurate predictions. We have also picked a fast approach to clustering and improved its speed further. We have shown that different activation functions can help with training deep learning networks and that they can be trained using different ways of describing data, that matrix factorisation is roughly as accurate and about twice as fast when chopped up into smaller pieces, and that conformal prediction runs well on reasonable size machines.
The work on software has focused on how to run many programs at the same time on a very large HPC machine, how to easily make different variants of matrix factorisation, and how to write programs that efficiently deal with the sorts of sparse data that is analysed in chemogenomics. The main results are that we have software to do these three tasks. In addition we have done a lot of analysis of the algorithms to understand how to make them run efficiently, and on which bits of computer hardware they can work well. The software is being used at the company partners, and has be demonstrated to various groups of possible users.
The work on data has consisted of preparing the data to feed to the software to test how well it (and the algorithms it implements) works. Tests have been done against the best current competitors, and we\'ve done some very large scale experiments to show where the new techniques give better results, and to start to understand why. The data we prepared was also used for a competition between universities to see who could come up with good machine learning ideas.

Final results

The progress in algorithms is more easily trainable neural networks, more scalable approaches to matrix factorisation, how to combine the outputs of different methods using conformal prediction, and a faster implementation of clustering. In software, the advances are more efficient means of running machine learning algorithms on HPC that are faster than the nearest competitor, a toolkit that allows for novel combinations of the building blocks of matrix factorisation and fast deep learning software . For the data, the main contributions are the public release of a data set that is similar to size and form to that seen in industry, so that other groups can benchmark their development of algorithms for this size and difficulty of problem, and preparation of various other data sets used together within the project.
The main result at the end of the project is benchmarked improvements in accuracy and speed for building chemogenomics models using HPC machines, as a combined result of the improvements in algorithms and software. Other results include further data set preparation, improvements to the software to make it easier to use, and an understanding of how well the different machine learning techniques scale up on to different types of big HPC machine.
The potential socio-economic impact of the project is large. The pharmaceutical industry is instrumental in tackling serious chronic diseases whose importance is growing as the population in Europe ages. Improving how they discover treatments to diseases such as Alzheimers and diabetes would have a huge impact. In addition, the industry is a major employer in and contributor to exports of the EU. The machine learning methods and systems in the project also have applications in e-commerce, fraud detection and manufacturing, all of which benefit society. In addition, the further reinforcing of European technology and HPC capabilities will have beneficial knock-on effects for scientific advance and job creation.

Website & more info

More info: http://excape-h2020.eu/.