Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - ALIGNED (Aligned, Quality-centric Software and Data Engineering)

Teaser

ALIGNED is developing tools and methods for parallel software and data engineering of web-scale information systems. The complexity, volatility and variable quality of data available through the Internet far exceeds the capacity of traditional software engineering to produce...

Summary

ALIGNED is developing tools and methods for parallel software and data engineering of web-scale information systems. The complexity, volatility and variable quality of data available through the Internet far exceeds the capacity of traditional software engineering to produce applications which can master it. Our goal is to narrow this gap – to produce innovative tools and methods which reduce the complexity and cost of creating and maintaining high-quality datasets and the software applications which use them. We are producing technological foundations to enable actors in the European software industry to develop a new generation of low-cost agile services which leverage the vast amount of data available on the web.

In our approach, the categorisation and description of data is considered to be a primary activity from which software tools can be automatically generated if quality controls can be maintained. Linked Data is used as a unifying technical foundation which allows not only the domain data to be described, but also the process of tool use and tool integration. This enables continuous improvement and ongoing customisation of the software and data-model in close tandem.

Recent years have seen an explosion in interest in statistical analysis of large datasets in a wide variety of application domains. Yet the tools to apply such analysis remain basic, expensive and difficult to use. The ALIGNED project includes four use cases covering diverse domains. Seshat involves teams of social scientists building datasets which describe historical societies, while DBpedia, the hub of the web-of-data is building a high-quality general purpose encyclopaedia of structured data. The other two use-cases, Wolters Kluwer’s Jurion legal information system and the Semantic Web Company’s PoolParty product, cover commercial situations, a software user and a software developer respectively.

The concrete objective of ALIGNED is to demonstrate that the tools and methods that we develop can be integrated into the work-flows of all of these four use cases in such a way that they produce measurable improvements in terms of productivity, quality and agility. Achieving this across our diverse use cases will demonstrate the general utility and significance of our innovations.

Work performed

The major technical results are new and extended software tools; new data-models, ontologies and datasets; scientific publications, technical reports and standards; and integrations of our technology with production systems to support our use cases.

ALIGNED has extended Oxford’s Model Catalogue and Semantic Booster tools to allow them to integrate seamlessly with linked data technologies. Leipzig’s RDFUnit has been extended to provide a new suite of quality control measures and integrated with the popular JUnit testing suite. The Semantic Web Company have extended their PoolParty platform with a new consistency module and developed a new unified governance tool to integrate administration across their software and data teams. Trinity College Dublin have developed several new services for their Dacura platform, including the Dacura Quality Service, offering real time validation of complex data changes, and extended their SUMMR tool to support integration with RDFUnit. All of these are available as open source software through our website.

ALIGNED has created a suite of 11 new ontologies, describing domains ranging from temporal uncertainty to enterprise software development, and contributed to 4 other ontologies as part of standardisation efforts. We have also published two large datasets. All of these results have been made available with open source licenses. In several cases our results have been incorporated into evolving W3C standards – in particular the Data Quality Vocabulary (DQV) and Shapes Constraint Language (SHACL) incorporate the fruits of our labours in several places.

Our work has been scientifically validated through 18 distinct peer-reviewed publications: 10 of were in respected journals or high-impact conferences. We expect this to improve further: our work on the Seshat use-case has produced results which are being prepared for submission to both Science and Nature.

A considerable effort has also been expended on providing the infrastructure to support both internal and external communication and collaboration. In addition to organising 6 well attended project meetings, we have established a website, secure storage are, wiki-based collaboration platform, mailing lists, online meeting software and dissemination channels on a variety of social media platforms: Facebook, Twitter, Youtube, Slideshare, Flickr and kept a steady stream of content flowing – 16 videos have already been published.

Most significantly, our results have been deployed in the production systems of all four of our use-cases, including both commercial scenarios. Wolters Kluwer’s production Jurion system has been using ALIGNED technology (RDFUnit) since 2015 and the consistency validation system developed within ALIGNED is a now feature of PoolParty. In our other use cases, not only has our technology been already deployed in their production systems, but we have already collected evidence which demonstrates significant improvements to their data-quality as a direct result of our innovations.

Final results

ALIGNED has contributed considerably to extending the scientific state of the art in a number of areas, including data quality analysis, semantics based software engineering, transactional triple-stores and dataset curation workflows. Our impact has been greatly magnified through our ability to implement usable software tools based on our research and to put them together into platforms and toolchains which can be applied to multiple different use-cases. We have already achieved significant impact beyond academia with our tools and models already deployed in the production systems of all four use-cases. Furthermore, by providing our use-case partners with new tools and methods which which enable them to have greater impact upon our society and economy, our impact will be magnified. Millions of users – including Jurion and PoolParty customers and members of the public browsing the DBpedia and Seshat datasets - will have directly used our innovations by the end of the project.

In ensuring that our results have impact after the project ends, our major focus continues to be validating our tools and methods through demonstrating concrete improvements in productivity, agility and quality in our use cases. Our tools deployed within Wolter Kluwer’s Jurion platform can spread from there to other WKD products or be adopted by other publishing companies, if there is strong evidence for their efficacy. By making all of our tools available as open source software, we have ensured that this technology transfer can be as smooth as possible.

However, although our focus is on practical results, we still need dissemination.We have focused heavily on organising and presenting at industrial and community focused workshops. We have active participation in the EC Clusuter on Software Engineering for Services and Applications as well as broader initiatives like the Big Data Value Association. We have organised a workshop at the influential SPLASH conference and are the main organisers of the SEMANTICS conferences.

Website & more info

More info: http://aligned-project.eu/.