Related Readings
Concepts
- Token: A token is the technical name for a sequence of characters
- Word type: A word type is the form or spelling of the word independently of its specific occurrence in a text.
- What does this really mean?
- Lexical diversity
- Frequency distribution
- Collocation: A collocation is a sequence of words that occur together unusually often.
- Lexical Resource
- lexical entry
- head word / lemma
- part of speech / lexical category
- sense definition / gloss
- WordNet
- senses and synonyms
- the wordnet hierarchy
- hyponyms
- hypernyms
- meronyms
- holonyms
- antonymy

- Universal Part-of-Speech Tagset

- Normalizing text
- Stemming and lemmatization
- Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
- Note that stemming is not a well-defined process and we typically pick the stemmer that best suits the application we have in mind
- Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
- For example, if confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.
- The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.
- Stemmers
- The Porter stemmer
- The Lancaster stemmer
- Tokenizing text
- It seems that there is no general method and we may need to write our own parser.
- Sentence segmentation
- Word segmentation
List of Common Methods
- concordance
- similar
- common_contexts
- dispersion_plot
- FreqDist
- Collocations
- from nltk.corpus import PlaintextCorpusReader
- ConditionalFreqDist.plot()
- ConditionalFreqDist.tabulate()
- nltk.corpus.words.words()
- nltk.corpus.stopwords.words("english")
- from nltk import word_tokenize
- nltk.Index()
- nltk.WordNetLemmatizer()
----- END -----
©2019 - 2022 all rights reserved