Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - STREAMLINE (STREAMLINE)

Teaser

\"STREAMLINE will address the competitive advantage needs of European online media businesses (EOMB) by delivering fast reactive analytics suitable in solving a wide array of problems, including addressing customerretention, personalised recommendation, and more broadly...

Summary

\"STREAMLINE will address the competitive advantage needs of European online media businesses (EOMB) by delivering fast reactive analytics suitable in solving a wide array of problems, including addressing customer
retention, personalised recommendation, and more broadly targeted services. STREAMLINE will develop cross sectorial analytics drawing on multi-source data originating from online media consumption, online games,
telecommunications services, and multilingual web content.

STREAMLINE partners face big and fast data challenges. They serve over 100 million users, offer services that produce billions of events, yielding over 10 TB of data daily, and possess over a PB of data at rest. Their business use cases are representative of EOMB, which cannot be handled efficiently & effectively by state-of-the-art technologies, as a consequence of system and human latencies.

System latency issues arise due to the lack of appropriate (data) stream-oriented analytics tools and more importantly the added complexity, cost, and burden associated with jointly supporting analytics for both “data at rest” and “data in motion.” The attached picture \"\"Current Lambda architecture\"\" is an illustration of the current state-of-the art.

Human latency results from the heterogeneity of existing tools and the low level programming languages required for development using an inordinate number of boilerplate codes that are system specific (e.g., Hadoop, SolR, Esper, Storm, and databases) and a plethora of scripts required to glue systems together. This today requires very specific expertise in the areas mentioned.

All these factors contribute to making it difficult for European businesses to find enough people with the right competence to address the need for solutions. To unlock the the potential of Big Data in Europe we need to develop better methods and tools that are easy to use even for non-experts.

STREAMLINE has formulated the following objectives, means to achievement and measurable outcomes towards the goals of reducing complexity, enabling faster results, and supporting both \"\"data at rest\"\" and \"\"data in motion\"\"
in a single system in three innovative use cases of high business impact for STREAMLINE partners.

Objective I: To research, design, & develop a massively scalable, robust, and efficient processing platform for data at rest and data in motion in a single system.
Objective II: To develop a high accuracy, massively scalable data stream-oriented machine learning library based on new algorithms and approximate data structures.
Objective III: To provide a unified interactive programming environment that is user-friendly, i.e., easy to deploy in the cloud and validate its success as measured by well-defined KPIs.
Objective IV: To implement a real-time contextualisation engine, enabling analytical and predictive models to take real world context into account.
Objective V: To improve industrial partner business performance by their introducing new services, improving performance, reducing costs, and growing their business, as measured by well-defined KPIs, in a refactorable fashion that is easily reusable by other companies and industry domains.

STREAMLINE will build on the European Open Source project Apache Flink and propose enhancements to achieve the states objectives and reach maximal impact. The envisioned architecture is pictured in the attached \"\"The architecture for Streamline\"\"






\"

Work performed

During the first period of the project STREAMLINE it has been important to establish close cooperation with the Apache Flink community. The project is now firmly integrated in the processes called FLIP, Flink Improvement Process, to be able to put forward improvement that will benefit the whole community.

Work in the research work packages (WP1 - WP3) have addressed the two fundamental challenges in STREAMLINE, ie combining streaming and batch processing in the same platform and ease of use. Work on the unified platform is ongoing but early results are promising, among work is using a Flink feature called side inputs to mix data in motion and data at rest. Ease of use has also been adressed in the first period. A very useful visualisation technique for streaming data has been successfully published and acknowledged. Setup and use has been greatly simplified by employing enhanced scripting and integrating in frameworks such as Yarn and Hopsworks.
The importance of machine learning today can not be missed by anyone and STREAMLINE has been adding strong capabilities to perform machine learning on streaming data.

All this work has been validated working with the four industrial use cases at Rovio for gaming, NMusic for music recommendation, Altice for analysing data from IP TV and IMR for contextualisation.

Final results

The work on a platform for unifying stream processing and batch processing is now extending the type of advanced analytics possible. Also important
is the work done on ease of use as well as making sure the platform is robust and has more advanced fault tolerance that today. This is making it possible for European
enterprises to build competitive offerings using Big Data and analytics. Examples of companies other than our use cases that are using the results in Apache Flink are Zalando,
King.com (Candy Crush), Bouygues Telecom, Ericsson and many more.

The impact more long term will also be a strong supply of European data scientist and non-data scientists due to the more easy to use system with more advances features that allow
applications that fit the need of the European industry.

Website & more info

More info: http://h2020-streamline-project.eu/.