Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - FashionBrain (Understanding Europe’s Fashion Data Universe)

Teaser

The primary goal of any retailer is to understand the customer and predict upcoming demands/trends. However, even a complete record of past purchases (and returns) is insufficient to fully understand how items from a product catalogue align with customers’ general tastes...

Summary

The primary goal of any retailer is to understand the customer and predict upcoming demands/trends. However, even a complete record of past purchases (and returns) is insufficient to fully understand how items from a product catalogue align with customers’ general tastes, lifestyle choices and aspirations. Additionally, from a business perspective, any efficiency gains in the logistics of supplier management, shipping and handling are minor, compared to the gains one could obtain from a better understanding of customers’ personalities and habits. In this project we consolidate and extend existing European technologies in the area of database management, data mining, machine learning, image processing, information retrieval, and crowdsourcing to strengthen the position of European fashion retailers among their world-wide competitors. Improvements in the fashion industry value-chain are expected as a result of project outcomes, such as novel online shopping experiences, the detection of influencers, and the prediction of upcoming fashion trends. Tangible outcomes include software, demonstrators, and novel algorithms for the future data-driven fashion industry. Project objectives are: to develop a general schema to support cross-domain data integration activities; to design new techniques for entity extraction from text and images; to create new human-machine techniques for data integration; to develop a novel big data infrastructure to support scalable multi-source data federation; to design new crowdsourcing interfaces customised for fashion related crowdsourcing tasks; to develop improved in-database text mining systems and new algorithms for image retrieval.

Work performed

Firstly, we worked on building domain knowledge:
- Analysed existing solutions (industrial and academic) for data integration to help us position our work. Output of this was a poster presented at the KDD ML4Fashion workshop.
- Conducted interviews with internal stakeholders to better understand available data, related challenges and use cases. This helped to identify a number of research challenges pertinent to FashionBrain.
- Developed a fashion taxonomy, which aggregates various sources, such as the Fashwell taxonomy (a unified resource comprising 20 fashion retailers) complemented with publicly available sources, such as Google Product (demo: https://fashionbrain-project.eu/fashion-taxonomy/).
- Produced software requirements for time series analysis and developed a Probabilistic RNN for sequential data with missing values (code: https://github.com/zalandoresearch/probrnn).

Secondly, we have worked on semantic data integration from three different perspectives:
- Developed techniques for entity extraction from text and images. An important outcome is FashionNLP: a natural language processing tool for fashion related text (code: https://github.com/eXascaleInfolab/fashionNLP).
- Provided foundations of semantic data integration for data extraction from images, releasing library of deep learning models (code: https://github.com/zalandoresearch/fb-model-library).
- Provided initial solutions to store and share the FashionBrain taxonomy, common datasets, extracted entities and link, and as well in-database methods and solutions (on-going).

Thirdly, we have developed human computation and crowdsourcing tools to improve the quality of training data and perform annotation at scale:
- Improved crowdsourcing agreement measures, with a publication at HCOMP-17, live demo (http://agreement-measure.sheffield.ac.uk) and open source code.
- Released the open source ModOp browser plugin improve crowdsourcing interfaces. (www.github.com/FashionBrainTeam/ModOp).
- Analysed the vulnerabilities of crowdsourcing interfaces and potential biases related, with publication at HCOMP-18 (winning Best Paper Award).
- Performed a study on perceived bias in crowdsourcing, with a publication at SIGIR 2018 and publicly available dataset.
- Presented poster paper on rating systems in crowdsourcing at HCOMP-16.

Fourthly, we have developed In-Database-Mining and Deep Learning methods:
- Performed a study for integrating entity linkage in a main memory database system (IDEL) integration with MonetDB. A demo and a paper are available.
- Undertaken work on Neural Paragraph Retrieval (SMART-MD). A demo and a paper are available.
- Realised an in-database machine learning approach in MonetDB. Two lightning talks, a poster and a YouTube video are available.

Fifthly. we have worked on social media streams:
- Developed a tool (RecovDB) for the recovery of missing values and implemented within MonetDB.
- Created a method (TimeSVDvc) for the prediction of user preferences in fashion data (code: https://github.com/FashionBrainTeam/timesvd_vc.git D5.3).

Finally, we have worked on text-to-product and image-to-product search:
- Developed an image to product entity linkage data model (demo: http://instashop.fashwell.com/fashionbrain_instances).
- Developed state-of-the-art general NLP framework, FLAIR (code: https://github.com/zalandoresearch/flair).

Final results

Our work in neural language modeling for sequence labelling has given rise to results representing the new state-of-the-art in a number of core natural language processing (NLP) tasks, including Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. The approach has been bundled into a framework called FLAIR, which is available open source on Github as a Python library built upon the Pytorch deep learning framework. The library ships with state-of-the-art pre-trained models for a range of NLP tasks and includes options for training custom models.

An image entity linkage data model outperforms Google’s state-of-the-art on academic DeepFashion consumer-to-shop benchmark datasets: Google (Song et al 2017) 39.2%, Fashwell 40.1%. These achievements significantly improve on the quality of existing technologies. Based on results from the FashionBrain project, Fashwell have been able to increase their service-quality (accuracy) by around 20% across all of their offerings.

We benchmarked a number of existing Open Information Extraction tools; a key finding being that none of these tools work on idiosyncratic data (recall and precision are often much lower than 40%). We investigated tools based on Deep Neural Networks (DNNs) for the task of relation extraction and tools for paragraph retrieval to identify paragraphs rich in factual information for the fashion domain.

Since the Winter term 17/18, Beuth have been running an English language Data Science class (Masters) with 22 students (http://data-science.berlin/). In the last two periods there have been more than 300 applicants. FashionBrain related topics, such as Deep Learning, entity linkage etc., are part of the curriculum. Students have also investigated how to understand deep learning on fashion images (mostly trousers and shoes from the Zalando MNIST dataset) with layerwise relevance propagation (see: https://prof.beuth-hochschule.de/loeser/teaching/archive/ws-1718/?L=0). Moreover, students from the course have funded a company with text mining methods for predicting next best sellers: http://www.qualifiction.info/ (German only).

Our work on Machine Learning solutions for crowdsourcing presented at HCOMP18, an internationally leading conference on Human Computation, won the Best Paper Award and we received an invitation to publish the paper in the Journal of Artificial Intelligence Research (paper forthcoming).

Website & more info

More info: https://fashionbrain-project.eu/.