Generate Unigrams Bigrams Trigrams Ngrams Etc In Python

less than 1 minute read

To generate unigrams, bigrams, trigrams or n-grams, you can use python’s Natural Language Toolkit (NLTK), which makes it so easy.

First steps

Run this script once to download and install the punctuation tokenizer:

 import nltk
 nltk.download('punkt') 

Unigrams, bigrams and trigrams

By using these methods you will get the lists for each:

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
bigrams = bigrams(unigrams)
trigrams = trigrams(unigrams)

Hint for unigrams

For simple unigrams you can also split the strings with a space.

n-grams

To generate 4-grams (n = 4):

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
4grams =  ngrams(unigrams, 4)

n-grams in a range

To generate n-grams for m to n order, use the method everygrams: Here n=2 and m=6, it will generate 2-grams,3-grams,4-grams,5-grams and 6-grams.

from nltk word_tokenize
from nltk import bigrams, trigrams

unigrams = word_tokenize("The quick brown fox jumps over the lazy dog")
2to6grams = everygrams(unigrams, 2, 6)

More info about the nltk can be found here.

Share on

Twitter Facebook LinkedIn

Arshad Mehmood

Generate Unigrams Bigrams Trigrams Ngrams Etc In Python

First steps

Unigrams, bigrams and trigrams

Hint for unigrams

n-grams

n-grams in a range

Share on

Leave a comment

You may also enjoy

Obsidian Notes Linker - Open Source

Setup Prometheus & Grafana in Docker

DevOps Journey

Big O Notation Cheatsheet in Landscape PDF