Natural Language Processing
In this practical, we'll run through a basic example of natural language processing.
We're going to parse a poem, looking for proper nouns, and then see if they are geocodable places. Along the way, we'll try out some of the nltk functionality.
First, let's get our poem as text in a Python program. We're going to use
The Waste Land, by T.S.Eliot.
Using your lecture notes, can you download the file and trim it down to the body of the poem, which
lies between the end of the list of CONTENTS and the start of the line Line 415 aetherial] aethereal
.
Print the text, which isn't too long, to confirm this.
Next in your program, tokenize the raw text, and convert it into a nltk.Text
object. DON'T convert it to lowercase text. We're going to look for
proper nouns, and one way nltk spots these is using capitalisation. It is one of the few times you don't want to lowercase text.
At this stage, try running the following analyses from the lecture:
20 most common words;
20 most common word lengths;
All the words over 10 letters long.
Next, run part-of-speech tagging across the text. DON'T use the universal tag list; instead just use:
tagged = nltk.pos_tag(text)
or equivalent. This will generate the fuller tag list, including NNP
, the tag for proper nouns. Again, print the tag list to check it. You'll notice that some
of the so-called proper nouns are no such thing. That's fine; it is usual to have a scattering of false positives – we'll
filter these out in a bit.
Next, pull out all those terms with the tag NNP
. There are a variety of ways you could do this: using the chunking grammar method in
the lecture notes is probably the easy route, but you may be able to think of others; your tagged list is a list of tuples of tag and word.
Finally, whatever shape you've currently got the list in, get it in a list called proper_nouns
. If you have your list in a subtree like in the lecture, you can convert this to
a string by casting it, and then chop out the relevant text thus:
for sentence in sentences:
tree = cp.parse(sentence)
for subtree in tree.subtrees():
if subtree.label()=='ProperNouns':
st = str(subtree)
slash = st.find("/")
st = st[13:slash] # len of ProperNouns + space
This is a good time to also filter out some of the
more dubious proper nouns (you might like to use string's isupper()
and
isalpha()
functions to identify these).
Once you've got the proper nouns out and in a list of some sort, go on to the next step, where we'll see which ones are places by geocoding them.