less than 1 minute read

To generate unigrams, bigrams, trigrams or n-grams, you can use python’s Natural Language Toolkit (NLTK), which makes it so easy.

First steps

Run this script once to download and install the punctuation tokenizer:

 import nltk
 nltk.download('punkt') 

Unigrams, bigrams and trigrams

By using these methods you will get the lists for each:

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
bigrams = bigrams(unigrams)
trigrams = trigrams(unigrams)
Hint for unigrams

For simple unigrams you can also split the strings with a space.

n-grams

To generate 4-grams (n = 4):

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
4grams =  ngrams(unigrams, 4)

n-grams in a range

To generate n-grams for m to n order, use the method everygrams: Here n=2 and m=6, it will generate 2-grams,3-grams,4-grams,5-grams and 6-grams.

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
2to6grams = everygrams(unigrams, 2, 6)

More info about the nltk can be found here.

Leave a comment