Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - ARGUE_WEB (Probabilistic Argumentation on the Web)

Teaser

A review of the current literature in the extraction of arguments and their components revealed that there were no existing tools that enabled the extraction of arguments and their components. Recent research in the classification of argumentative text using traditional...

Summary

A review of the current literature in the extraction of arguments and their components revealed that there were no existing tools that enabled the extraction of arguments and their components. Recent research in the classification of argumentative text using traditional machine learning algorithms focused on the recognition of arguing subjectivity (please see below a brief review of research) and showed that it is possible to identify subjectivity in arguing text found in online debates (blog posts and editorials). The methods employed in these cases, were similar to sentiment recognition in text. Although research on the subjectivity of arguing is a significant progress in the recognition of arguments, its is still not possible to extract, represent and use natural language arguments in KRR systems for reasoning. In order to overcome the above obstacle, the Researcher was asked to investigate Neural Network Deep Learning algorithms for the extraction of data. The lack of sufficiently large annotated data was a major problem in the experimental evaluation of different neural network architectures. The Researcher tried to overcome the problem by using NLP tools. Although NLP tools are widely used in the traditional machine learning tasks, their accuracy is very low when used for the extraction of types of text. The main goal of this project as can be defined as: to investigate algorithms and tools capable of identifying argumentation text on line textual resources. The term \'argumentation text\' in this context refers to text intended to persuade the participants in a debate.

Work performed

\"The work covered so far falls under the following two categories:
(I) Natural Language and Linguistics for the identification of patterns in text.
(II) Neural Deep Learning algorithms for the classification of text (ongoing)

(I). In terms of NL and Linguistics, during the period of my fellowship I became acquainted and have implemented various methods falling under the following categories:

1. Extraction of data from various public political forums on the web and the AraucariaDB files. In particular, I have used the Scrapy [22] library to extract data from forums: www.debatepolitics.com, www.politicsforum.ac.uk, and www.discoursedb.org.

2. Classification algorithms (e.g. K-Means) showed the difficulty of identifying categories where semantic information relevance cannot be modeled adequately.

3. Parsing techniques and grammars.
(a). I have implemented various parsing and lexicographic methods in order to extract the syntactic patterns involved in argumentative text and assess their effectiveness when used for the automatic extraction of argumentative text. For example, the following lexico-graphic pattern (via regular expressions) shows that even for very simple syntactic patterns, regular expressions can become quite complex.

president = \"\"The President of Cyprus argued that the bill was an excuse by the EU\"\"
toksTags = \';\'.join([\'/\'.join(list(t)) for t in nltk.pos_tag(president.split())])
arguer = re.compile(r\"\"(?Pthe/DT;[A-Za-z]+(/NN|/NNS|/NNP);((of|in)/IN;[A-Za-z]+/(NN|NNS|NNP);)?)(?P(argue/VBP;|argued/VBD;))\"\", re.I)
print(“RESULT:”, arguer.search(toksTags).groupdict())
> RESULT: {\'ARG_EVENT\': \'argued/VBD;\', \'ARGUER\': \'The/DT;President/NNP;of/IN;Cyprus/NNP;\'}

(b) nltk libraries that enable the representation of text in the form of Tree structures. The intention behind this labeling of textual data is that it can be used to query data but it can also be used as input in classification algorithms, like for example the recursive deep learning algorithm used for sentiment analysis.

(c) Grammars and chunking: the part of speech tags, although useful in determining the syntactic type of individual tokens were not adequate to describe sequences of tokens referring to single syntactic category. E.g. Noun Phrases (NP) vs Nouns (NN, NNS) in PennTrees but also tokens that jointly refer to the subject of a verb. Grammars have been defined via the Stanford nltk library, e.g. grammar1 = r\"\"\"\"\"\" {(<(NN.*)>)(<(MD)>)(<(VB)>)(<(DT)>)(<(JJ)>)*(<(NN.*)>)} and parsed via RegexpParser. Available libraries (like Pattern) provided the ability to perform chunking on nested tagging structures; for example, Chunks are derivable as a by-product of the primary intended functionality of Pattern. For example, lets say we have a subjective lexicon, then we can extract subjective parts of Noun Phrases as follows (the example is rather simplistic and is used for illustration purposes):

Annotation = namedtuple(\'Annotation\', [\'text\', \'label\'])
parsed = parse(txt, lemmata=True, relations=True, chunks=True)
s = Sentence(parsed)
subjective_lexicon = [\'bad\',\'good\', \'beautiful\', \'ungly\']
for c in s.chunks:
if c.type.startswith(\'NP\'):
jj_s = [w.string for w in c if w.type==\'JJ\']
if jj_s:
subj = filter(lambda w:w.lower() in subjective_lexicon, jj_s)
if subj:
annot = Annotation(text=c.string, label=\'SUBJECTIVE\')
print(annot)

4. Deriving the entity types of textual components can be helpful in cases where textual components with the same syntactic structure but different entity types should be categorized differently. Examples of libraries used for entity recognition are: StanfordNERTagger, ne_chunk (nltk), and spaCy2.


5. Sentiment and Subjectivity are related to the notions of Opinion Mining and Arguing and can be used within other algorithms to determine the nature of subjectivity involved. Valuable tools that provide information about this asp\"

Final results

Various methods have been used which were dispersed. However, due to the limitations of technical nature discussed earlier (e.g. data) no publication was made. Of course, given the time I am optimistic that continuation of this research will prove results that will benefit my Institute.

Website & more info

More info: http://www.cs.ox.ac.uk/.