Transfer Learning - NLP
rpi.analyticsdojo.com
This is adopted from: Bag of Words Meets Bags of Popcorn wendykan/DeepLearningMovies
77. Transfer Learning - NLP#
To be meaningfully modeled, words must be turned into Vectors. This covers a number of the approaches for text vectorazation 1.0.
78. Bag of Words#
import nltk
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from gensim import similarities
import pandas as pd
import numpy as np
from gensim import models
# import custom filters
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_numeric, stem_text, preprocess_string
from gensim.parsing.preprocessing import strip_multiple_whitespaces, strip_non_alphanum, remove_stopwords, strip_short
from gensim import corpora
from gensim.test.utils import common_corpus, common_dictionary
from gensim.similarities import MatrixSimilarity
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile
from gensim.models.doc2vec import TaggedDocument
import pyarrow.parquet as pq
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from pathlib import Path
from gensim.models import Phrases
from gensim.models.phrases import Phraser
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/unlabeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/testData.tsv
--2021-11-15 18:35:32-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv [following]
--2021-11-15 18:35:32-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33556378 (32M) [text/plain]
Saving to: ‘labeledTrainData.tsv.4’
labeledTrainData.ts 100%[===================>] 32.00M 171MB/s in 0.2s
2021-11-15 18:35:32 (171 MB/s) - ‘labeledTrainData.tsv.4’ saved [33556378/33556378]
--2021-11-15 18:35:32-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/unlabeledTrainData.tsv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/unlabeledTrainData.tsv [following]
--2021-11-15 18:35:33-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/unlabeledTrainData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 67281491 (64M) [text/plain]
Saving to: ‘unlabeledTrainData.tsv.4’
unlabeledTrainData. 100%[===================>] 64.16M 199MB/s in 0.3s
2021-11-15 18:35:33 (199 MB/s) - ‘unlabeledTrainData.tsv.4’ saved [67281491/67281491]
--2021-11-15 18:35:33-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/testData.tsv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/testData.tsv [following]
--2021-11-15 18:35:33-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/testData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32724746 (31M) [text/plain]
Saving to: ‘testData.tsv.4’
testData.tsv.4 100%[===================>] 31.21M 161MB/s in 0.2s
2021-11-15 18:35:34 (161 MB/s) - ‘testData.tsv.4’ saved [32724746/32724746]
train = pd.read_csv('labeledTrainData.tsv', header=0, \
delimiter="\t", quoting=3)
unlabeled_train= pd.read_csv('unlabeledTrainData.tsv', header=0, \
delimiter="\t", quoting=3)
test = pd.read_csv('testData.tsv', header=0, \
delimiter="\t", quoting=3)
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
print(train.columns.values, test.columns.values)
['id' 'sentiment' 'review'] ['id' 'review']
train.head()
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
print('The train shape is: ', train.shape)
print('The train shape is: ', test.shape)
The train shape is: (25000, 3)
The train shape is: (25000, 2)
print('The first review is:')
print(train["review"][0])
The first review is:
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
train
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
... | ... | ... | ... |
24995 | "3453_3" | 0 | "It seems like more consideration has gone int... |
24996 | "5064_1" | 0 | "I don't believe they made this film. Complete... |
24997 | "10905_3" | 0 | "Guy is a loser. Can't get girls, needs to bui... |
24998 | "10194_3" | 0 | "This 30 minute documentary Buñuel made in the... |
24999 | "8478_8" | 1 | "I saw this movie as a child and it broke my h... |
25000 rows × 3 columns
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
def embed_univ(df,column):
encoder_lib_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(encoder_lib_url) # current encoder as at May 20th, 2021 - url "https://tfhub.dev/google/universal-sentence-encoder/4"
message_embeddings = embed(df[column])
df[column+'_universal'] = pd.Series(message_embeddings.numpy().tolist())
return df
train2=embed_univ(train.iloc[0:10,:], 'review')
train2
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if __name__ == '__main__':
id | sentiment | review | review_universal | |
---|---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... | [0.030300239101052284, 0.0033060263376682997, ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... | [-0.04255800321698189, -0.04781642183661461, -... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... | [-0.05121680349111557, 0.030820466578006744, 0... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... | [-0.025275127962231636, 0.051208171993494034, ... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... | [-0.01964237168431282, 0.052018746733665466, -... |
5 | "8196_8" | 1 | "I dont know why people think this is such a b... | [-0.009250563569366932, 0.0061204154044389725,... |
6 | "7166_2" | 0 | "This movie could have been very good, but com... | [-0.02197437360882759, -0.02234342321753502, 0... |
7 | "10633_1" | 0 | "I watched this video at a friend's house. I'm... | [-0.008400843478739262, 0.06209466978907585, 0... |
8 | "319_1" | 0 | "A friend of mine bought this film for £1, and... | [-0.025548789650201797, 0.01659647934138775, 0... |
9 | "8713_10" | 1 | "<br /><br />This movie is full of references.... | [0.038312334567308426, 0.019368555396795273, 0... |
#title Configure the model { run: "auto" }
BERT_MODEL = "https://tfhub.dev/google/experts/bert/wiki_books/2" #
# Preprocessing must match the model, but all the above use the same.
PREPROCESS_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
def embed_bert(df, column ):
preprocess = hub.load(PREPROCESS_MODEL)
bert = hub.load(BERT_MODEL)
inputs = preprocess(df[column])
outputs = bert(inputs)
df[column+'_bert']=pd.Series(outputs["pooled_output"].numpy().tolist())
return df
train2=embed_bert(train2.iloc[0:10,:], 'review')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
train2
id | sentiment | review | review_universal | review_bert | |
---|---|---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... | [0.030300239101052284, 0.0033060263376682997, ... | [0.9035062193870544, -0.21512015163898468, 0.6... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... | [-0.04255800321698189, -0.04781642183661461, -... | [0.9060668349266052, 0.08904127031564713, 0.78... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... | [-0.05121680349111557, 0.030820466578006744, 0... | [0.9220373630523682, -0.6700941324234009, 0.80... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... | [-0.025275127962231636, 0.051208171993494034, ... | [0.8882725834846497, -0.07439149171113968, 0.6... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... | [-0.01964237168431282, 0.052018746733665466, -... | [0.8837865591049194, 0.17711809277534485, 0.55... |
5 | "8196_8" | 1 | "I dont know why people think this is such a b... | [-0.009250563569366932, 0.0061204154044389725,... | [0.903205394744873, 0.24590986967086792, 0.663... |
6 | "7166_2" | 0 | "This movie could have been very good, but com... | [-0.02197437360882759, -0.02234342321753502, 0... | [0.8890115022659302, 0.18718616664409637, 0.62... |
7 | "10633_1" | 0 | "I watched this video at a friend's house. I'm... | [-0.008400843478739262, 0.06209466978907585, 0... | [0.904928982257843, -0.4985009431838989, 0.647... |
8 | "319_1" | 0 | "A friend of mine bought this film for £1, and... | [-0.025548789650201797, 0.01659647934138775, 0... | [0.8822431564331055, -0.29646652936935425, 0.2... |
9 | "8713_10" | 1 | "<br /><br />This movie is full of references.... | [0.038312334567308426, 0.019368555396795273, 0... | [0.9010588526725769, -0.0026878498028963804, 0... |