Artificial Intelligence Reader
Tuesday, May 13, 2014
Building a Lexical Database for an Interactive Joke-Generator
Abstract.
As part of a project to construct an interactive program which will encourage children to play with language by
building jokes, we have developed a large lexical database, closely based on WordNet. As well as the standard
WordNet information about part of speech, synonymy, hyponymy, etc, we have added phonetic representations and
symbolic links allowing attachment of pictures. All information is represented in a relational database, allowing
powerful searches using SQL via a Java API. The lexicon has a facility to label subsets of the lexicon with symbolic
names, and we are working to incorporate some educationally relevant word lists as sublexicons. This should also
allow us to improve the familiarity ratings which the lexicon assigns to words.
1. Background.
Children who have a disability (e.g. cerebral palsy, early
brain trauma) which affects their verbal communication
often develop their linguistic and interpersonal skills much
more slowly than comparable children without these
problems. One factor contributing to this slower
development may be lack of experience of normal, everyday
language use, particularly with the peer group (Donahue &
Bryan 1984). A child who is forced to communicate through
a voice output communication aid (a speech synthesiser
coupled to a suitably engineered text input device) cannot
participate fully in the banter, joking and word play which is
widespread in the conversation of young children.
The aim of the STANDUP project1 (System To Augment
Non-speakers’ Dialogue Using Puns) is to explore a way in
which language technology might help to alleviate this
situation, by providing a software language playground
through which a child can play with words and phrases in a
way which is exploratory, enjoyable and educational. To be
more precise, we are building interactive software which
allows children with language difficulties to explore words
and phrases by building simple puns through a specialised
user interface. The software contains a powerful riddlegenerator
which the user controls through menus, options,
and the selection of words. We are about to evaluate the
overall system, by carrying out systematic trials in which
young children will be asked to carry out various tasks with
the software. Standard literacy tests will be used to see how
basic skills and use of the STANDUP system are related.
The feasibility of automating the construction of punning
riddles was demonstrated by the JAPE program, which
could form simple punning riddles (Binsted et al., 1997).
Some of JAPE's better examples were:
What is the difference between leaves and a car?
One you brush and rake, the other you rush and brake.
What do you call a strange market?
A bizarre bazaar.
JAPE was a first research prototype which was limited in
certain ways: it was not interactive (and hence had no real
user interface), it took a long time to produce jokes, and the
quality of the jokes (riddles) was often quite poor. We have
used essentially the same ideas as those used in JAPE to
build a system which is large-scale, fully engineered, robust,
fast enough for interactive use, and with a user interface
suitable for use by our target group (children with
communication disabilities).
A central part of this endeavour was the creation of a
suitable lexicon, since both the joke generator and the user
interface would be largely driven by information about
words (and simple phrases). This paper is about that aspect
of the work – how we defined our lexical requirements, the
existing resources available to us, how we combined some
of these resources into a lexical database, and the overall
facilities provided by the resulting lexicon. Our lexicon is
similar to existing lexicons, but it does have some features
which may be of interest to other potential users:
all data is stored in relational database tables, accessible
via SQL;
lexical entries contain a variety of linguistic information
– syntax, semantics, phonetics, othography, English gloss;
a large subset of the lexicon has facilities to attach
pictorial images from a standard set;
the pictorially linkable subset is organised into a simple
concept hierarchy;
various word-frequency information from corpora and
educational literature is included.
2. Requirements
The requirements for the lexicon module came from two
sources: the needs of the riddle-generator, and the
requirements of users, in terms of both overall functionality
and specific user-interface facilities.
2.1 Joke generator requirements
Experience with the JAPE program, and some planned
improvements, led us to stipulate that the lexicon should:
i.allow lexical items to be compared for phonetic similarity
and identity;
ii.associate part-of-speech (POS) with each lexical item;
iii. include simple common noun compounds (e.g. door
stop), and idiomatic phrases consisting of a noun and premodifier
(e.g. red herring)
iv.distinguish different senses of a word /phrase;
v. include information about synonymy ;
vi.include hyponymy/hypernymy information;
vii.include meronymy information if feasible.
2.2 User requirements
We followed a user-centred design methodology. This led
us to consult two interested groups: potential users, and
suitable experts (teachers, speech and language therapists).
After drafting some initial design ideas, we presented these
to our informants in deliberately low-tech manner, involving
sketches and paper mock-ups of user-interface screens
(Manurung et al., 2005; O'Mara et al., 2004). This led to a
number of requirements for the system as a whole; it did not
make sense to ask our informants directly about the needs of
individual modules within the system, such as the lexicon.
We thus developed a specification for the system, including
a suitable user-interface, and tested the latter part with
users via a mockup (with no real joke-generator or lexicon).
The specification for the entire STANDUP system,
particularly the user-interface, had consequences for the
functionality of the lexicon, as follows:
i.speech output should be available;
ii.when displaying a lexical item, a pictorial symbol should,
if possible, accompany it, preferably from a standard
symbol-library used in augmentative and alternative
communication (AAC);
iii.word-senses should be grouped into subject-areas
(topics) to facilitate access by the user;
iv.the topics should be clustered into a hierarchy;
v.it is desirable to allow restricting the available vocabulary
to word-sets available in the educational or AAC fields;
vi.it must be possible to avoid words deemed unsuitable for
the target users (e.g. swear words, sexual terminology).
2.3 Practical considerations
General considerations of practicality, maintainability, etc.
meant data-preparation (e.g. reformatting or editing) shouldbe automated where feasible, so that new versions of the
lexical resource can be prepared, even if the quantities of
data are large.
3. Existing Resources
3.1 WordNet
No single lexical database supported all these functions. The
JAPE program used WordNet (Fellbaum,1998), which
fulfils most of the joke-generator's requirements: it has a
large number of entries (around 200,000), each word form is
associated with multiple senses, senses are grouped into sets
of synonyms and linked to hypernyms, and it is annotated
with word-sense frequency information (SemCor) derived
from a large corpus (Miller et al., 1993). Its use by JAPE
demonstrated that it provided the broad functionality needed
for creating riddles. It is also freely available. However, it
lacks phonetic data -- JAPE used phonetic identity (not
similarity), computed using various resources, including a
homophone list and the British English Example
Pronunciation dictionary. WordNet also lacks pictorial data,
and contains many words which are unsuitable for our target
users (mostly as a result of being highly obscure, non-
British, or archaic, rather than being socially unacceptable).
3.2 The disambiguation problem
There are a variety of lexicons around, mostly based on
conventional dictionaries owned by publishers. All of these
provide fewer of the required facilities (for joke generation)
than does WordNet. Moreover, they tend to have two major
limitations: they are not freely available for incorporation
into our software (particularly as we hope to make our
system available at little or no cost), and useful information
(e.g. pictures, frequency data) is usually attached to word
forms (word strings as spelled in normal text) rather than to
word senses (distinct meanings). The latter was a serious
deficiency for us. If a word had two radically different
senses (for example, match meaning “a sporting event”, or
match meaning “a small stick for creating fire”), it would
not be appropriate to use the picture for one sense when
displaying the other sense. Also, one word-sense might be
very common but the other very obscure; for example, bus
as means of transport, or as “the topology of a network
whose components are connected by a busbar”. In such a
case, our joke generator needs to be able to make puns
which depend only on the familiar meaning, as the user is
unlikely to know of very arcane senses. Hence, any
frequency rating which is attached only to the word form
(e.g. bus), as, for example, in the COBUILD dictionary,
could be misleading. We considered various ways in which
statistical or text-matching methods could be used to
associate the attached information (from publishers’
dictionaries) with separate WordNet senses, but could not
find or devise one which seemed sufficiently reliable. The
SemCor frequencies within WordNet, on the other hand, are
attached to senses (“synsets”), which made them
immediately usable.
4. The STANDUP Lexicon
4.1 Overview
Using data from WordNet and other sources, we have built a
relational database, with tables containing fields for wordforms,
word-senses, phonetic representations, the subparts
of compound nouns, etc. There are also familiarity scores
and codes to link to pictorial images. The database also
contains various pre-cached tables of useful linguistic
relations, such as phonetic similarity and rhyming. Access
to the database from our main program (in Java) was
handled by connecting to a Postgres server, which could
respond to queries in SQL.
4.2 Phonetic forms
From the Unisyn text-to-speech dictionary2, we
constructed a table where an entry contains a word-form, a
unique ID, a part of speech (POS), and a phonetic sequence.
By comparing word-forms and POS data, nearly 100,000
WordNet entries (senses) were unambiguously allocated a
phonetic representation. Additionally, over 32,000 noun
word-forms in WordNet of the forms “X_Y” or “X-Y” (e.g.
“blind_alley”, “self-service”) were treated as compound
nouns, and phonetic representations for the parts were
unambiguously allocated using Unisyn (with POS for X, Y
inferred from their positions).
4.3 Phonetic similarity
Phonetic similarity (0 < s ≤ 1, 1 being identity) was
computed between pairs involving all the word forms used
as lexical head words, using a normalised minimum edit
distance (Jurafsky & Martin 2000, Chapter 5) between the
Unisyn phonetic representations, and pairs reaching a
threshold (s ≥ 0.75) were stored in the database, along with
the actual score. SQL queries could then be defined which
selected only those entries which exceeded some threshold
(which had to be greater than this baseline).
4.4 Other phonetic relations
Various relationships computable from the basic phonetic
forms were pre-computed and stored for faster access:
homophones, e.g. board and bored; rhymes (defined –
roughly -- as having phonetic forms which ended identically
from the last stressed syllable onwards), e.g. pub and rub;
word forms which were prefixes of other words
(phonetically), e.g. axe and access; and spoonerism
sequences (quadruples of lexemes whose phonetic forms
can be segmented into x,y,z,w such that A = xz,
B = yw, C= yz, D = xw, with some syllabic constraints), e.g.
burn, ache, urn, bake.
4.5 Frequency/familiarity ratings
As noted already, each lexeme has a SemCor frequency
value, taken directly from WordNet. We have also included
SemCor ratings in some of the other tables which we have
pre-computed to assist the joke-generator. However, this
rating has certain weaknesses for our purposes. It is based
on a sense-annotated version of the Brown corpus (Francis
& Kučera 1982), which contains texts published in the USA
in 1961. This means that the pattern of frequencies is not
highly reliable as a guide to familiarity for young British
children in 2006. For example, some common words, such
as baker, onion, and sleepy score 0 (i.e. do not appear in the
corpus), others (milk, nail), have very low scores (i.e. appear
very rarely in the corpus), whereas some more obscure
terms, such as stock, business, performance, vocational and
polynomial, are highly rated (frequent). We are therefore
treating SemCor scores as a provisional familiarity rating,
until we can devise and implement something better.
4.6 Pictures
In order to have pictures associated with lexemes, there
were two problems to solve: finding a suitable set of
electronic pictorial images, and ensuring that these images
were attached to appropriate senses. The Rebus set of
symbols (small picture images), owned by Widgit Software
Ltd3, are used in a number of proprietary programs in the
general area of special needs and AAC. They are intended to
depict the meanings of individual words, and can be used (in
the Widgit software) for tasks such as elucidating the
meanings of individual words within a text, or constructing
picture arrays for communication devices. Widgit granted
us permission to use the Rebus symbol set (which contains
over 10,000 items) in the STANDUP interactive software.
However, we still faced the disambiguation problem: the
symbols were linked not to word senses but to word forms.
In view of the demand from our users for picture support,
we decided to invest the effort in disambiguating the Rebus
symbol set by hand. As a result, approximately 7500
lexemes in our database have symbolic codes which allow
the direct attachment of Rebus pictures.
4.7 Labelled word sets
The software allows for any arbitrary set of lexemes to be
grouped together and given a mnemonic name, thereby
allowing subsets of the overall lexicon to be manipulated
separately. We have made use of this to impose prohibitions
on particular words. For our educational application, it was
important to be able to exclude certain words from
appearing in computer-generated jokes: swear words,racially offensive terms, etc. We therefore incorporated an
explicit list of words to be excluded from use by the joke
generator. This was done by looking in an electronic version
of the Shorter OED for entries which had “coarse slang” or
“racially offensive” in the relevant fields, then (by hand)
creating a STANDUP-style sublexicon containing only the
corresponding STANDUP lexical entries.
We are also looking into having a set of preferred word
sets, based on various vocabularies from the educational
literature, for two reasons. Firstly, when evaluating the full
STANDUP system with users, it is useful to categorise the
lexemes used within jokes according to their level of
accessibility to children. Secondly, to increase the likelihood
that the joke-generator produces jokes comprehensible to
young children, words in these word-sets should be
preferred in searches for possible words/phrases. We are
currently planning how to integrate this with SemCor data to
give an improved measure of familiarity.
Once again, disambiguation by hand is required to create
these lexeme sets, as published word lists contain only word
forms, not specific senses. Fortunately, the sets are typically
fairly small – two or three thousand words.
4.8 Topic hierarchy
As noted earlier, we wanted users to be able to access
information via topics (subclasses of subject matter).
WordNet’s hypernym hierarchy is unsuitable for this
purpose, being a philosophical ontology rather than a
classification of a child’s everyday world into recognisable
categories. However, the Rebus pictorial symbols are
linked to “conceptcode” IDs defined by Widgit, and the
conceptcodes are clustered into topics. Once the WordNet
senses were linked to Widgit conceptcodes, this
automatically connected them both to the pictures and the
Widgit topic sets. The hand-disambiguation between wordsenses
and pictorial images mentiond earlier was carried out
using these concept-codes, thereby linking this subset of
WordNet senses to the Widgit topic hierarchy.
5. Distribution
Distribution arrangements, for the full STANDUP system or
for the lexicon module, are not decided, but we intend to
make the software as freely available as possible; details
will be posted on the STANDUP website (see footnote 1).
Some of the annotations may be lodged with the Concept
Coding Framework4. Although Widgit have given
permission for their Rebus pictorial images to be used in the
full STANDUP system, no such arrangement has been made
for the lexicon on its own. However, a few thousand of the
commoner senses in the lexicon do contain connections
from WordNet senses to Widgit symbol identifiers, which means that a researcher who had legitimate access to the
Rebus images could attach them.
6. Conclusions
The development of the STANDUP lexicon is still in
progress at present (February 2006). We have a lexical
database, accessible from a Java API, which systematically
links phonetic, topic and pictorial information to a large
subset of the WordNet senses. It has around 130,000 wordsenses,
all with phonetic information, and around 7500 are
linked to “conceptcodes” which allow the attachment
(subject to licensing) of pictorial symbols. This is at the
centre of the STANDUP interactive joke-generation system,
which allows users to browse through available types of
riddles, possible words and phrases, a hierarchy of topics,
and to request the generation of a riddle to meet certain
criteria. Although this is a specialised application, we hope
that the lexical resource will be of wider use.
Acknowledgements
The work reported here was supported by grants
GR/S15402/01 and GR/R83217/01 from the UK
Engineering and Physical Sciences Research Council. We
are grateful for the help of Widgit Software Ltd.
Subscribe to:
Posts (Atom)