Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - ThReDS (A Theory of Reference for Distributional Semantics)

Teaser

\"The overarching goal of ThReDS was to build a computer system that can \'refer\', i.e. generate descriptions of concepts or entities in a way that uniquely identifies them for a human (e.g. Harry Potter = \"\"the wizard with the round glasses\"\", Hermione = \"\"the best student at...

Summary

\"The overarching goal of ThReDS was to build a computer system that can \'refer\', i.e. generate descriptions of concepts or entities in a way that uniquely identifies them for a human (e.g. Harry Potter = \"\"the wizard with the round glasses\"\", Hermione = \"\"the best student at Hogwarts\"\", etc). Having such a system would let us understand better how humans communicate with each other, and help us build artificial agents that can converse with us.

Before an artificial agent can talk, it needs to learn about the world, just as a child would do. In linguistic and computational terms, this means acquiring *representations* of the things the agent is exposed to. In the field of Distributional Semantics, such computational representations have traditionally been built from raw text data (sometimes enriched with visual information) and take the form of a \'vector\', that is, a mathematical model of the way a particular word is used by human beings, as experienced by the agent. Such vectors can be found in many everyday applications like search engines, recommendation systems and conversational agents. So far, however, they have only been constructed for *concepts* (e.g. \'student\', \'owl\', \'broom\') rather than individual entities (\'Harry Potter\', \'Hedwig\', \'Harry\'s Nimbus 2000\'). This is because current algorithms need considerable amounts of data to learn properly, and references to individual entities are much less frequent in raw text than generic occurrences of words. Further, those raw vector representations are not suitable to refer from, because they do not explicitly encapsulate the properties of the concept or individual that a human would use to identify them (e.g. \'wearing glasses\' for Harry Potter). In order to make vectors compatible with so-called \'Referring Expression Generation\' systems, that is, algorithms that can produce successful references to things in the world, a translation must be found to a more formal and structured representation of meaning, which in theoretical linguistics has its incarnation in \'Model-theoretic Semantics\'.

ThReDS tackled two challenges: a) the computational extraction of representations of entities from raw text, concentrating on the small data issue; b) the theoretical account of how raw exposure to linguistic data (distributional semantics information) can shape the agent\'s representation of the world (their model-theoretic semantics).
\"

Work performed

As part of the project, I have released a piece of software called Nonce2Vec, which allows a computer to learn the meaning of a new word extremely rapidly, from one sentence of context only. This ability to learn from very small data is called \'fast-mapping\' in psycholinguistics and is a noticeable mark of human intelligence. Nonce2Vec is a first approximation of this remarkable human trait. The software is accompanied by a publication available from the proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017).

Nonce2Vec is hoped to pave the way for extracting representations of single entities from text, as well as new concepts. To give an example, if a computer were to simulate a human reading the Harry Potter series, it should be able to very quickly learn who Harry or Hermione are, and to follow their development throughout the text, modifying its representation of the characters as it reads. It should similarly be able to learn new concepts like \'quidditch\'. I have started setting up an experiment to test the ability of the software to acquire quality representations of individuals. This new experiment involves simulating the broad picture that a human might acquire of a person after reading a Wikipedia article about them. A balanced dataset of individuals has been produced, together with an experimental setup for eliciting individual properties from human subjects. This preliminary work will allow me to conduct a series of behavioural and computational experiments in the future.

Finally, the linguistic theory behind ThReDS has been laid out in a draft paper. The paper highlights how what is *said* about an entity or concept (i.e. the noisy, observable data that a human or computer might learn from) can be related to the *properties* of that entity/concept in the world (a \'cleaner\', database-like representation). This translation from observable data to properties is essential to explain how humans acquire the formal models necessary to discriminate between individuals and refer to them accurately.

Final results

The work performed in ThReDS is expected to impact on computational linguistics, as well as neighbouring fields such as cognitive science and artificial intelligence (AI). First, I have released a neural network tool for linguistic fast-mapping (Nonce2Vec) which derives from a popular architecture for learning computational representations of meaning (Word2Vec), and is thus fully integratable with the current state-of-the-art in AI. Second, the project was one of very few endeavours to produce vector representations of singular entities -- a challenge for computational linguistics due to the small amount of data available for each entity in text corpora. The Nonce2Vec software is hoped to tackle this challenge. Finally, more theoretically, the project has produced an account of the relation between formal, property-like representations and distributional, raw textual data. The proposal is a novel contribution in the growing subfield of \'Formal Distributional Semantics\', i.e. the research area dealing with the integration of logical and corpus-extracted meaning representations.

On a more general level, the contributions made by ThReDS are expected to help build more intelligent artificial agents and crucially, to gain an understanding of *how* and *what* they learn through their exposure to linguistic data.

Website & more info

More info: http://aurelieherbelot.net/threds/.