Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - BIGCODE (Learning from Big Code: Probabilistic Models, Analysis and Synthesis)

Teaser

The project is addressing the general challenge of building more reliable, efficient and secure software. This is an important challenge as billions of euro are spent yearly on software defects. The unique angle of this project and its objective is to learn machine learning...

Summary

The project is addressing the general challenge of building more reliable, efficient and secure software. This is an important challenge as billions of euro are spent yearly on software defects. The unique angle of this project and its objective is to learn machine learning models from large amounts of code (called Big Code) which capture the history and experience of how existing programs (billions of lines of code) have been built and created over time. These models can then be used to guide the creation of new programs not feasible otherwise, as well as debugging and understanding existing programs (for example, finding critical security errors). Technically, the project sits at the intersection of several areas of computer science: machine learning, program synthesis and program analysis. This inter-disciplinary nature enables the creation of a new sub-area of computer science (learning from code, similar to areas where one learns from videos or from natural language). As a result, this leads to creating new ground breaking techniques and systems.

Work performed

So far, the project has achieved a number of milestones:

- we have created a new way to program mobile user interfaces by leveraging Big Code and Synthesis techniques. This work and our current research on the topic has the potential to change the way UI programming is performed: the idea is to simply provide an example of a picture that we want our mobile application to have, and then have the machine generate the actual computer code that produces this picture.

- we have also built advanced, explainable models of code which are faster to train than standard deep learning methods. These models are particularly useful for source code, but we have also found applications in natural language processing. Having interpretable models is a major challenge (see latest EU regulations) and our work addresses this in the context of programs. We are currently continuing to improve these models given the substantial interest from academics and companies, and plan to release the full code publicly.

- we have also built a number of impactful big code-based systems, such as https://debin.ai and http://apk-deguard.com/ , which use graphical models learned over big code to de-obfuscate programs. Both of these systems are widely used today by companies and academics.

- finally, we have investigated the problem of robustness of deep learning, a problem beyond code. We are now using these developed techniques and are applying them to training robust models of code and other classifiers, all of which are of practical use.

Final results

We are expecting to:

- develop techniques and systems for training robust models of code. This has not been done up to now yet it is critical (e.g., for malware detection).
- develop an end to end system (with several new techniques) for generating complete Android layout code from an image (this includes attribute prediction deep learning methods).

Website & more info

More info: http://plml.ethz.ch.