43. Natural Language Toolkit#

43.1. Introduction to Natural Language Processing#

In this workbook, at a high-level we will learn about text tokenization; text normalization such as lowercasing, stemming; part-of-speech tagging; Named entity recognition; Sentiment analysis; Topic modeling; Word embeddings

####PLEASE EXECUTE THESE COMMANDS BEFORE PROCEEDING####

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True
#Tokenization -- Text into word tokens; Paragraphs into sentences;
from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
sent_tokenize(text) 
['Hello everyone.',
 'Welcome to Intro to Machine Learning Applications.',
 'We are now learning important basics of NLP.']
import nltk.data 
  
german_tokenizer = nltk.data.load('tokenizers/punkt/PY3/german.pickle') 
  
text = 'Wie geht es Ihnen? Mir geht es gut.'
german_tokenizer.tokenize(text) 
from nltk.tokenize import word_tokenize 
  
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
word_tokenize(text) 
from nltk.tokenize import TreebankWordTokenizer 
  
tokenizer = TreebankWordTokenizer() 
tokenizer.tokenize(text) 

43.1.1. n-grams vs tokens#

43.1.1.1. n-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.#

43.1.1.2. Tokens do not have any conditions on contiguity#

#Using pure python

import re

def generate_ngrams(text, n):
    # Convert to lowercases
    text = text.lower()
    
    # Replace all none alphanumeric characters with spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in text.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)
generate_ngrams(text, n=2)
#Using NLTK import ngrams

import re
from nltk.util import ngrams

text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]
output = list(ngrams(tokens, 3))
print(output)
#Text Normalization

#Lowercasing
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
lowert = text.lower()
uppert = text.upper()

print(lowert)
print(uppert)
#Text Normalization
#stemming
#Porter stemmer is a famous stemming approach

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
  
# choose some words to be stemmed 
words = ["hike", "hikes", "hiked", "hiking", "hikers", "hiker"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
import re
   
ps = PorterStemmer() 
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ps.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence 
text2 = ' '.join(tokens) 

print(text2)
Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic We are now learn import basic of nlp
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
   
ss = SnowballStemmer("english")
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ss.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence 
text2 = ' '.join(tokens) 

print(text2)
Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic we are now learn import basic of nlp
#Stopwords removal 

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(text) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 

text2 = ' '.join(filtered_sentence)
#Part-of-Speech tagging

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = 'GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside 40 million developers.'

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

sent = preprocess(text)
print(sent)
[('GitHub', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('development', 'NN'), ('platform', 'NN'), ('inspired', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('way', 'NN'), ('you', 'PRP'), ('work', 'VBP'), ('.', '.'), ('From', 'IN'), ('open', 'JJ'), ('source', 'NN'), ('to', 'TO'), ('business', 'NN'), (',', ','), ('you', 'PRP'), ('can', 'MD'), ('host', 'VB'), ('and', 'CC'), ('review', 'VB'), ('code', 'NN'), (',', ','), ('manage', 'NN'), ('projects', 'NNS'), (',', ','), ('and', 'CC'), ('build', 'VB'), ('software', 'NN'), ('alongside', 'RB'), ('40', 'CD'), ('million', 'CD'), ('developers', 'NNS'), ('.', '.')]
#Named entity recognition

#spaCy is an NLP Framework -- easy to use and having ability to use neural networks

import en_core_web_sm
nlp = en_core_web_sm.load()

text = 'GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside 40 million developers.'

doc = nlp(text)
print(doc.ents)
print([(X.text, X.label_) for X in doc.ents])
#Sentiment analysis
#Topic modeling
#Word embeddings

#Class exercise

43.1.1.3. 1. Read a file from its URL#

43.1.1.4. 2. Extract the text and tokenize it meaningfully into words.#

43.1.1.6. 4. Perform stemming using both porter and snowball stemmers. Which one works the best? Why?#

43.1.1.7. 5. Remove stopwords#

43.1.1.8. 6. Identify the top-10 unigrams based on their frequency.#

#Load the file first
!wget https://www.dropbox.com/s/o8lxi6yrezmt5em/reviews.txt
--2019-11-04 17:16:22--  https://www.dropbox.com/s/o8lxi6yrezmt5em/reviews.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601b:1::a27d:801
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/o8lxi6yrezmt5em/reviews.txt [following]
--2019-11-04 17:16:23--  https://www.dropbox.com/s/raw/o8lxi6yrezmt5em/reviews.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com/cd/0/inline/AruGnazr2R1e797TKXdu6chwkg102fB893qSsoT5EeI2_mAFsj2rCinxKGPdm-HpQjOZqWQ21tvsPDpyA7PBxc7QxoDCWKG45GDwN1gZw3C7RlMLoxb8D9NG9IqmJ25IXJc/file# [following]
--2019-11-04 17:16:23--  https://ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com/cd/0/inline/AruGnazr2R1e797TKXdu6chwkg102fB893qSsoT5EeI2_mAFsj2rCinxKGPdm-HpQjOZqWQ21tvsPDpyA7PBxc7QxoDCWKG45GDwN1gZw3C7RlMLoxb8D9NG9IqmJ25IXJc/file
Resolving ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com (ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com)... 162.125.9.6, 2620:100:601f:6::a27d:906
Connecting to ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com (ucb753980f94c903b140fb69cb47.dl.dropboxusercontent.com)|162.125.9.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3851 (3.8K) [text/plain]
Saving to: ‘reviews.txt’

reviews.txt         100%[===================>]   3.76K  --.-KB/s    in 0s      

2019-11-04 17:16:24 (328 MB/s) - ‘reviews.txt’ saved [3851/3851]