Script Acquisition: A Crowdsourcing and Text mining approach
Abstract/ Overview
According to Grice’s (1975) theory of pragmatics, people tend to omit basic information
when participating in a conversation (or writing a narrative) under the assumption that left
out details are already known or can be inferred from commonsense knowledge by the
hearer (or reader). Writing and understanding of texts makes particular use of a specific
kind of common-sense knowledge, referred to as script knowledge. Schank and Abelson
(1977) proposed Scripts as a model of human knowledge represented in memory that stores
the frequent habitual activities, called scenarios, (e.g. eating in a fast food restaurant, etc.),
and the different courses of action in those routines.
This thesis addresses measures to provide a sound empirical basis for high-quality script
models. We work on three key areas related to script modeling: script knowledge acquisition, script induction and script identification in text. We extend the existing repository
of script knowledge bases in two different ways. First, we crowdsource a corpus of 40
scenarios with 100 event sequence descriptions (ESDs) each, thus going beyond the size of
previous script collections. Second, the corpus is enriched with partial alignments of ESDs,
done by human annotators. The crowdsourced partial alignments are used as prior knowledge to guide the semi-supervised script-induction algorithm proposed in this dissertation.
We further present a semi-supervised clustering approach to induce script structure from
crowdsourced descriptions of event sequences by grouping event descriptions into paraphrase sets and inducing their temporal order. The proposed semi-supervised clustering
model better handles order variation in scripts and extends script representation formalism,
Temporal Script graphs, by incorporating "arbitrary order" equivalence classes in order to
allow for the flexible event order inherent in scripts.
In the third part of this dissertation, we introduce the task of scenario detection, in which
we identify references to scripts in narrative texts. We curate a benchmark dataset of annotated narrative texts, with segments labeled according to the scripts they instantiate. The
dataset is the first of its kind. The analysis of the annotation shows that one can identify scenario references in text with reasonable reliability. Subsequently, we proposes a benchmark
model that automatically segments and identifies text fragments referring to given scenarios. The proposed model achieved promising results, and therefore opens up research on
script parsing and wide coverage script acquisition