Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - FASTPARSE (Fast Natural Language Parsing for Large-Scale NLP)

Teaser

The popularization of information technology and the Internet has resulted in an unprecedented growth in the scale at which individuals and institutions generate, communicate and access information. In this context, the effective leveraging of the vast amounts of available...

Summary

The popularization of information technology and the Internet has resulted in an unprecedented growth in the scale at which individuals and institutions generate, communicate and access information. In this context, the effective leveraging of the vast amounts of available data to discover and address people\'s needs is a fundamental problem of modern societies.

Since most of this circulating information is in the form of written or spoken human language, natural language processing (NLP) technologies are a key asset for this crucial goal. NLP can be used to break language barriers (machine translation), find required information (search engines, question answering), monitor public opinion (opinion mining), or digest large amounts of unstructured text into more convenient forms (information extraction, summarization), among other applications.

These and other NLP technologies rely on accurate syntactic parsing to extract or analyze the meaning of sentences. Unfortunately, current state-of-the-art parsing algorithms have high computational costs, processing less than a hundred sentences per second on standard hardware. While this is acceptable for working on small sets of documents, it is clearly prohibitive for large-scale processing, and thus constitutes a major roadblock for the widespread application of NLP.

The goal of this project is to eliminate this bottleneck by developing fast parsers that are suitable for web-scale processing. To do so, FASTPARSE will improve the speed of parsers on several fronts: by avoiding redundant calculations through the reuse of intermediate results from previous sentences; by applying a cognitively-inspired model to compress and recode linguistic information; and by exploiting regularities in human language to find patterns that the parsers can take for granted, avoiding their explicit calculation. The joint application of these techniques will result in much faster parsers that can power all kinds of web-scale NLP applications.

Work performed

So far, we have developed and published a wide range of research results on fast syntactic parsing models, as well as research studying underlying properties of human language that are crucial to build efficient, cognitively-inspired syntactic parsers. These results include:

- Quantitative studies of crossing dependencies (a syntactic phenomenon that makes language more difficult to parse for computers) in different languages, and their origin; as well as studies on the length and direction of syntactic dependencies (properties that also affects difficulty and speed of parsing) and their influence in parsing with different models.

- Novel techniques for greedy transition-based parsing (the fastest family of parsing models currently available), including new dynamic oracles to increase the accuracy of greedy algorithms without affecting their speed, non-monotonic parsing techniques, semantic parsing models, and algorithms that reduce the length of transition sequences needed to parse a sentence.

- New architectures that make exact inference for non-projective transition-based parsing (a highly accurate and flexible method for parsing) practically feasible for the first time. Previously, the computational complexity requirements of known algorithms for this purpose made these techniques too slow to be applicable in practice.

- A new method to implement constituent parsing as sequence labeling, a standard machine learning task that is usually considered simpler and faster to implement, taking advantage of the bounded depth of constituent trees in real natural language usage, which has produced the fastest parsing speeds reported for English and other languages by a wide margin. We have also applied a generalization of the same method to dependency parsing, again breaking reported speed records by a large margin, and used multitask learning so that constituent and dependency parsing can benefit from each other for the first time.

- A technique for speeding up parsers using memorized previous results.

- A cognitively-inspired model based on splitting sentences into chunks to obtain faster parsers.

- A variant of the previously most accurate dependency parser (a parser based on a recent neural network architecture called pointer networks) which uses a different algorithm to guide processing, increasing its accuracy even more while making it twice as fast.

- A parser, also based on pointer networks, for languages with discontinuous constituents (such as German) whose accuracy is the best reported to date by a wide margin, while being very fast.

Final results

During this period, we have substantially advanced the state of the art in several fronts. Firstly, in terms of the speed of parsing algorithms, which is the main goal of the project: for example, we have presented greedy parsers that need a smaller number of actions than the existing alternatives to analyze a sentence; we have made dynamic programming for non-projective transition-based parsing feasible by presenting an actual implementation, while previous approaches were purely theoretical due to being too slow to be usable in practice; and we have introduced a novel parsing paradigm based on reformulating the problem as sequence labeling, yielding significantly better speeds than all previous existing methods; among other developments. Secondly, in terms of understanding of the underlying difficulties for parsing and of human language processing (by means of studies of quantitative properties of human syntax). Additionally, we have also advanced the state of the art accuracy both in dependency parsing and in phrase structure parsing with discontinuous constituents. In dependency parsing, our left-to-right pointer-network parsers achieved the highest accuracy in the standard parsing benchmarks of English and various other languages, while at the same time being twice as fast as its predecessors. In discontinuous constituent parsing, our model outperformed the previous state of the art in German (the commonly-used benchmark language for this task) by a wide margin of over 4 percentage points.

During the rest of the project, we expect to continue to improve the speed of syntactic parsers by combining knowledge from linguistics, cognitive science and computer science. We will finish developing the techniques to speed up parsers based on reuse of previous results and on chunking, which will provide further speed gains when added to the models developed until now. We will also refine and improve the technique of parsing as sequence labeling to further optimize its speed, extend it to discontinuous constituents and semantic parsing, and continue exploiting regularities in constituent trees to further reduce the search space of parsers and improve their efficiency.

Overall, we intend to continue achieving major gains in parsing speed, spanning different languages (including those especially challenging for computational processing, such as languages with high degree of non-projectivity, or with complex morphology) and parser types (including constituent and dependency parsers) that will make parsing feasible at the web scale without the need for massive computational resources, hence making natural language processing technologies more useful and widely accessible.

Website & more info

More info: http://fastparse.grupolys.org.