Opendata, web and dolomites


Non-sequence models for tokenization replacement

Total Cost €


EC-Contrib. €






Project "NonSequeToR" data sheet

The following table provides information about the project.


Organization address
postcode: 80539

contact info
title: n.a.
name: n.a.
surname: n.a.
function: n.a.
email: n.a.
telephone: n.a.
fax: n.a.

 Coordinator Country Germany [DE]
 Total cost 2˙500˙000 €
 EC max contribution 2˙500˙000 € (100%)
 Programme 1. H2020-EU.1.1. (EXCELLENT SCIENCE - European Research Council (ERC))
 Code Call ERC-2016-ADG
 Funding Scheme ERC-ADG
 Starting year 2017
 Duration (year-month-day) from 2017-10-01   to  2022-09-30


Take a look of project's partnership.

# participants  country  role  EC contrib. [€] 


 Project objective

Natural language processing (NLP) is concerned with computer-based processing of natural language, with applications such as human-machine interfaces and information access. The capabilities of NLP are currently severely limited compared to humans. NLP has high error rates for languages that differ from English (e.g., languages with higher morphological complexity like Czech) and for text genres that are not well edited (or noisy) and that are of high economic importance, e.g., social media text.

NLP is based on machine learning, which requires as basis a representation that reflects the underlying structure of the domain, in this case the structure of language. But representations currently used are symbol-based: text is broken into surface forms by sequence models that implement tokenization heuristics and treat each surface form as a symbol or represent it as an embedding (a vector representation) of that symbol. These heuristics are arbitrary and error-prone, especially for non-English and noisy text, resulting in poor performance.

Advances in deep learning now make it possible to take the embedding idea and liberate it from the limitations of symbolic tokenization. I have the interdisciplinary expertise in computational linguistics, computer science and deep learning required for this project and am thus in the unique position to design a radically new robust and powerful non-symbolic text representation that captures all aspects of form and meaning that NLP needs for successful processing.

By creating a text representation for NLP that is not impeded by the limitations of symbol-based tokenization, the foundations are laid to take NLP applications like human-machine interaction, human-human communication supported by machine translation and information access to the next level.


List of deliverables.
Data Management Plan Open Research Data Pilot 2019-03-25 09:52:52

Take a look to the deliverables list in detail:  detailed list of NonSequeToR deliverables.


year authors and title journal last update
List of publications.
2019 Timo Schick, Hinrich Schütze
Learning Semantic Representations for Novel Words: Leveraging Both Form and Context
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61859
2018 Philipp Dufter, Hinrich Schütze
A Stronger Baseline for Multilingual Word Embeddings
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61864
2019 Apostolos Kemos, Heike Adel, Hinrich Schütze
Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61846
2019 Timo Schick, Hinrich Schütze
Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61863
2018 Yadollah Yaghoobzadeh, Heike Adel, Hinrich Schuetze
Corpus-Level Fine-Grained Entity Typing
published pages: 835-862, ISSN: 1076-9757, DOI: 10.1613/jair.5601
Journal of Artificial Intelligence Research 61 2019-06-06
2019 Timo Schick, Hinrich Schütze
Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61844
2018 Wenpeng Yin, Hinrich Schütze
Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms
published pages: 687-702, ISSN: 2307-387X, DOI: 10.1162/tacl_a_00249
Transactions of the Association for Computational Linguistics 6 2019-06-06
2018 Nina Poerner, Masoud Jalili Sabet, Benjamin Roth and Hinrich Schütze
Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective
published pages: , ISSN: , DOI: 10.5282/ubm/epub.61865
2019 Marco Cornolti, Paolo Ferragina, Massimiliano Ciaramita, Stefan Rüd, Hinrich Schütze
published pages: 1-42, ISSN: 1046-8188, DOI: 10.1145/3284102
ACM Transactions on Information Systems 37/1 2019-06-06

Are you the coordinator (or a participant) of this project? Plaese send me more information about the "NONSEQUETOR" project.

For instance: the website url (it has not provided by EU-opendata yet), the logo, a more detailed description of the project (in plain text as a rtf file or a word file), some pictures (as picture files, not embedded into any word file), twitter account, linkedin page, etc.

Send me an  email ( and I put them in your project's page as son as possible.

Thanks. And then put a link of this page into your project's website.

The information about "NONSEQUETOR" are provided by the European Opendata Portal: CORDIS opendata.

More projects from the same programme (H2020-EU.1.1.)

PRinTERs (2019)

Post-transcriptional regulation of effector function in T cells by RNA binding proteins

Read More  

EffectiveTG (2018)

Effective Methods in Tame Geometry and Applications in Arithmetic and Dynamics

Read More  

EpigeneticScars (2020)

Understanding DSB repair from pathway choice to long term effects and their consequences.

Read More