Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - SEQCLAS (A Sequence Classification Framework for Human Language Technology)

Teaser

The goal of this project is to develop a unifying framework of new methods for sequence classification and thus to push the state-of-the art in automatic speech recognition and statistical machine translation. The unifying principle in these human language technology (HLT)...

Summary

The goal of this project is to develop a unifying framework of new methods for sequence classification and thus to push the state-of-the art in automatic speech recognition and statistical machine translation. The unifying principle in these human language technology (HLT) tasks is that the system has to handle a sequence of input data and to generate an associated sequence of output symbols (i.e. words or letters) in a natural language like English or Arabic. For speech recognition, the input is the acoustic waveform or sequence of feature vectors after feature extraction and the task is to generate a correct transcription of the spoken word sequence. For machine translation, the input sequence is the sequence of words (or letters) in a source language and the output sequence to be generated is a wellformed sequence of words (or letters) in the target language. There is a third HLT task that is very similar to speech recognition and that we will therefore occasionally consider: the recognition of text images, i.e. of printed or handwritten text.

Speech recognition, machine translation and text image recognition are key techniques that are needed in a large number every-day situations:

* information access and management:
Due to the progress in information technology, huge amounts of unstructured speech and text data are now available in the worldwide web and the archives of companies, organizations and people. This data may exist in three type of forms:
1. speech data in audio and video documents;
2. digital text as in books, newspapers, patents, word-processor documents and e-mails;
3. image text (printed or handwritten) in scanned books and documents.
In addition, this data may exist in various languages, which is important for the multi-lingual situation in Europe. The big challenge is to provide much more powerful methods for the automatic processing of this unstructured data and for content-based access, e.g. by speech recognition, machine translation and text image recognition.

* human-machine and human-human communication using speech:
applications that can support humans in communication (customer help lines, call centers, e-commerce, human-human communication using speech translation, question-answering systems, ...).

After four decades of research, some techniques for sequence classification are now sufficiently developed to be deployed for practical applications (e.g. services like Google Translate, Amazon\'s Alexa, Apple\'s Siri and other similar services for speech or text input). Yet the major intellectual challenge is still to improve the basic technologies in order to better meet the needs of real-word applications and ultimately to achieve (or even exceed) human performance.

Work performed

(01-Aug-2016 to 31-Jan-2019)

Despite the huge progress made in the field, the specific aspects of sequence classification have not been addressed adequately in the past research. Instead of developing specific solutions for each particular task independently of each other, our approach is to identify fundamental problems in sequence classification that show up across the three HLT tasks and to come up with a novel unifying framework. In agreement with the proposal, the work carried out was organized in 5 tasks:

* Task 1: a theoretical framework for sequence classification:
In principle, the starting point for virtually all successful approaches to sequence classification is the Bayes decision rule for optimum performance. In practice however, many simplifications and approximations are used, and it is not clear how they affect the final performance. We have done first steps towards developing a theoretical framework around the performance criterion that gives answers to questions like: How does the system performance depend on the language model? How does the rule of selecting the most likely sequence affect the optimality of the Bayes decision rule and thus the system performance?

* Task 2: consistent modeling:
Sequence classification requires probabilistic models whose parameters are learned from the training data. In this task, we put emphasis on the requirement that these models should be exactly the same in training and testing, which for historic reasons is not always the case as in phrase-based machine translation. The most important result of this task is that we have developed a framework that allows a direct combination of deep learning and alignment concepts introduced for generative Hidden Markov models. This framework of neural Hidden Markov models has been applied both in speech recognition and machine translation. The initial versions of these systems show competitive results with the state-of-the-art methods and will be further extended. In addition, we have worked on neural net based feature extraction directly from the speech time signal waveform, which in the future could improve the traditional spectral analysis.

* Task 3: performance-aware training and modeling:
What matters ultimately for a sequence classification system is its performance. Most activities in this task have been concerned with language models and acoustic models. In language modelling, we have achieved significant improvements by using refinements of recurrent neural networks. For acoustic modelling, we have improved the concept of sequence discriminative training, which resulted in significant improvements of the recognition accuracy.

* Task 4: unsupervised training:
In the conventional approaches to machine translation, the systems perform a supervised type of training, i.e. the system is given a huge set of sentence pairs in source and target language. In contrast, in unsupervised training, no explicit sentence pairs are available, but only monolingual data in each of the two languages, and the question is: Can we train a system without explicit sentence pairs? How well does it work? Along these ideas, we have achieved first results for bidirectional English-German translation without using any parallel data for training. The training of the system was based on monolingual data only in combination with cross-lingual word embedding and iterative back-translation. This system has achieved the first position among other unsupervised translation systems in the evaluation of WMT 2018. We have also studied various methods to speed up unsupervised training.

* Task 5: research prototype systems:
The methods and algorithms developed have been integrated into the team\'s high-performance systems and have been evaluated on public databases and international benchmarks:
* for speech recognition: CHIME, Switchboard, Librispeech, AMI;
* for machine translation: WMT 2017 and 2018; IWSLT 2017 and 2018.

Final results

The project so far has produced many interesting results that were presented at scientific conferences and workshops and go beyond the state of the art.
There are four directions that we consider to be the most important ones:
* a theoretical framework and bounds on classification error;
* direct neural hidden Markov models for machine translation;
* direct neural hidden Markov models for speech recognition;
* ANN-based feature extraction from the speech signal.