Cornell notes : Natural Language Processing with Python - Chapter 2

Source : Natural Language Processing with Python ā€“ Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper

Question Answer
What is a corpora ? large bodies of linguistic data, large structure collection of texts
Get the list of texts from a corpus
corpus.fileids()
Access a default corpus in nltk
        from nltk.corpus import gutenberg
        nltk.corpus.gutenberg.words('austen-emma.txt')
        
Get the raw content of the file
gutenberg.raw(fileid)
Divide the text into its sentences
gutenberg.sents('shakespeare-macbeth.txt')
Access the default web texts in nltk
from nltk.corpus import webtext
Access default chat conversations in nltk
        from nltk.corpus import nps_chat
        chatroom = nps_chat.posts('10-19-20s_706posts.xml')
        
What are stylistics ? studying systematic differences between genres, word counts might distinguish genres : the most frequent modal in the news genre is "will", while the most frequent modal in the romance genre is "could"
Brown corpus a convenient resource for studying systematic differences between genres
from nltk.corpus import brown
Reuters Corpus for training and testing algorithms that automatically detect the topic of a document, categories in the Reuters corpus overlap with each other
from nltk.corpus import reuters
Inaugural Address Corpus temporal corpus : represent language use over time
from nltk.corpus import inaugural
Get the list of NLTK corpus http://nltk.org/data
Universal Declaration of Human Rights available in over 300 languages
from nltk.corpus import udhr
Get the categories of the corpus
corpus.categories()
Get the words of the whole corpus
corpus.words()
Loading your own Corpus
        from nltk.corpus import PlaintextCorpusReader
        corpus_root = '/usr/share/dict'
        wordlists = PlaintextCorpusReader(corpus_root, '.*')

        // OR

        from nltk.corpus import BracketParseCorpusReader
        corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
        file_pattern = r".*/wsj_.*\.mrg"
        ptb = BracketParseCorpusReader(corpus_root, file_pattern)
        
What is a conditional frequency distribution ? A collection of frequency distributions, each one for a different "condition". A conditional frequency distribution needs to pair each event with a condition.
        pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]
        cfd = nltk.ConditionalFreqDist(
        ...     (genre, word)
        ...     for genre in brown.categories()
        ...     for word in brown.words(categories=genre))
        
Build a table out of conditional frequency distributions
        cfd.tabulate(conditions=['English', 'German_Deutsch'],
        ...     samples=range(10), cumulative=True)
What is a bigram ? word pair
Build a list of consecutive word pairs
list(nltk.bigrams(words))
Generate random text
        def generate_model(cfdist, word, num=15):
            for i in range(num):
                print(word, end=' ')
                word = cfdist[word].max() // most likely word to follow the word variable

        text = nltk.corpus.genesis.words('english-kjv.txt')
        bigrams = nltk.bigrams(text)
        cfd = nltk.ConditionalFreqDist(bigrams)
        generate_model(cfd, ā€˜<initial word>')
        
Create a conditional frequency distribution from a list of pairs
cfdist = ConditionalFreqDist(pairs)
Get the conditions from the CFD
cfdist.conditions()
Get the frequency distribution for a given condition
cfdist[condition]
Get the frequency for the given sample for this condition
cfdist[condition][sample]
Get the tabulation limited to the specified samples and conditions
cfdist.tabulate(samples, conditions)
Generate a graphical plot of the conditional frequency distribution limited to the specified samples and conditions
cfdist.plot(samples, conditions)
Check if samples in cfdist1 occur less frequently than in cfdist2
cfdist1 < cfdist2
What is a lexicon ? a lexical resource, a collection of words and/or phrases along with associated information such as part of speech (lexical category) and sense definitions (gloss)
What is a lexical entry ? a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition
What are homonyms ? Two distinct words having the same spelling
The Words Corpus Use it to find unusual or mis-spelt words in a text corpus
nltk.corpus.words.words()
Stopwords corpus of high-frequency words
from nltk.corpus import stopwords
Names corpus corpus of 8,000 first names categorized by gender
names = nltk.corpus.names
What is a phone ? contrastive sound

        
CMU Pronouncing Dictionary list of phones in english
entries = nltk.corpus.cmudict.entries()
Swadesh wordlists comparative wordlist : 200 common words in several languages
from nltk.corpus import swadesh
Toolbox also called Shoebox, a collection of entries, where each entry is made up of one or more fields
from nltk.corpus import toolbox
WordNet a semantically-oriented dictionary of English
from nltk.corpus import wordnet as wn
What is a synset ? synonym set, a collection of synonymous words (or "lemmas")
wn.synsets('motorcar')
Get a list of synonyms for a given word
wn.synsets('motorcar').lemma_names()
Get a synset's verbose definition
synset.definition()
Get a synset's example sentences
synset.examples()
What is a lemma ? pairing of a synset with a word
Get all lemmas for a given word
text.lemmas(word)
What are root synsets ? unique beginners, very general concepts
What is a hyponym ? more specific concepts hierarchically
synset.hyponyms()
What is a hypernym ? up in the hierarchy
        synset.hypernyms()
        synset.root_hypernyms() //the most general hypernyms
        
What are lexical relations ? hypernyms and hyponyms, because they relate one synset to another
What are meronyms ? components of an item
        synset.part_meronyms() // tree => trunk, limb etc.
        synset.substance_meronyms() //heartwood, sapwood
        
What are holonyms ? items of a component (things they are contained in)
synset.member_holonyms() // tree => forest
What is an entailment ? relationship between verb : walking entails stepping
synset.entailments()
Get the antonyms of a synset
synset.antonyms()
Get the lexical relations of a synset
dir(synset)
What is semantic similarity ? If two synsets share a very specific hypernym they must be closely related
synset1.lowest_common_hypernyms(synset2)
Get the hierarchical depth of a synset
synset.min_depth()
How to calculate a semantic similarity score ?
synset1.path_similarity(synset2) // -1 if no path, 1 if identical

I am Basile, a young software craftsman documenting his entrepreneurship journey. If you liked this article, you can follow my adventures in real time on Twitter. Iā€™m always looking forward to meeting new people and learning from others !

My personal website : basilesamel.com