Bag of Words
rpi.analyticsdojo.com
This is adopted from: Bag of Words Meets Bags of Popcorn wendykan/DeepLearningMovies
46. Bag of Words#
import nltk
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/unlabeledTrainData.tsv
!wget https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/testData.tsv
--2019-03-14 00:57:13-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/labeledTrainData.tsv
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv [following]
--2019-03-14 00:57:13-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/labeledTrainData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33556378 (32M) [text/plain]
Saving to: ‘labeledTrainData.tsv.1’
labeledTrainData.ts 100%[===================>] 32.00M 191MB/s in 0.2s
2019-03-14 00:57:14 (191 MB/s) - ‘labeledTrainData.tsv.1’ saved [33556378/33556378]
--2019-03-14 00:57:15-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/unlabeledTrainData.tsv
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/unlabeledTrainData.tsv [following]
--2019-03-14 00:57:15-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/unlabeledTrainData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 67281491 (64M) [text/plain]
Saving to: ‘unlabeledTrainData.tsv.1’
unlabeledTrainData. 100%[===================>] 64.16M 234MB/s in 0.3s
2019-03-14 00:57:15 (234 MB/s) - ‘unlabeledTrainData.tsv.1’ saved [67281491/67281491]
--2019-03-14 00:57:16-- https://github.com/rpi-techfundamentals/spring2019-materials/raw/master/input/testData.tsv
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/testData.tsv [following]
--2019-03-14 00:57:17-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/testData.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32724746 (31M) [text/plain]
Saving to: ‘testData.tsv.1’
testData.tsv.1 100%[===================>] 31.21M 147MB/s in 0.2s
2019-03-14 00:57:17 (147 MB/s) - ‘testData.tsv.1’ saved [32724746/32724746]
train = pd.read_csv('labeledTrainData.tsv', header=0, \
delimiter="\t", quoting=3)
unlabeled_train= pd.read_csv('unlabeledTrainData.tsv', header=0, \
delimiter="\t", quoting=3)
test = pd.read_csv('testData.tsv', header=0, \
delimiter="\t", quoting=3)
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
print(train.columns.values, test.columns.values)
['id' 'sentiment' 'review'] ['id' 'review']
train.head()
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
print('The train shape is: ', train.shape)
print('The train shape is: ', test.shape)
The train shape is: (25000, 3)
The train shape is: (25000, 2)
print('The first review is:')
print(train["review"][0])
The first review is:
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup
# Initialize the BeautifulSoup object on a single movie review
example1 = BeautifulSoup(train["review"][0], "html.parser" )
print(example1.get_text())
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]", # The pattern to search for
" ", # The pattern to replace it with
example1.get_text() ) # The text to search
print (letters_only)
With all this stuff going down at the moment with MJ i ve started listening to his music watching the odd documentary here and there watched The Wiz and watched Moonwalker again Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent Moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord Why he wants MJ dead so bad is beyond me Because MJ overheard his plans Nah Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence Also the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line this movie is for people who like MJ on one level or another which i think is most people If not then stay away It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty Well with all the attention i ve gave this subject hmmm well i don t know because people can be different behind closed doors i know this for a fact He is either an extremely nice but stupid guy or one of the most sickest liars I hope he is not the latter
lower_case = letters_only.lower() # Convert to lower case
words = lower_case.split() # Split into words
# Enter Download then stopwords.
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
print (stopwords.words("english"))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print (words)
['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 'wants', 'mj', 'dead', 'bad', 'beyond', 'mj', 'overheard', 'plans', 'nah', 'joe', 'pesci', 'character', 'ranted', 'wanted', 'people', 'know', 'supplying', 'drugs', 'etc', 'dunno', 'maybe', 'hates', 'mj', 'music', 'lots', 'cool', 'things', 'like', 'mj', 'turning', 'car', 'robot', 'whole', 'speed', 'demon', 'sequence', 'also', 'director', 'must', 'patience', 'saint', 'came', 'filming', 'kiddy', 'bad', 'sequence', 'usually', 'directors', 'hate', 'working', 'one', 'kid', 'let', 'alone', 'whole', 'bunch', 'performing', 'complex', 'dance', 'scene', 'bottom', 'line', 'movie', 'people', 'like', 'mj', 'one', 'level', 'another', 'think', 'people', 'stay', 'away', 'try', 'give', 'wholesome', 'message', 'ironically', 'mj', 'bestest', 'buddy', 'movie', 'girl', 'michael', 'jackson', 'truly', 'one', 'talented', 'people', 'ever', 'grace', 'planet', 'guilty', 'well', 'attention', 'gave', 'subject', 'hmmm', 'well', 'know', 'people', 'different', 'behind', 'closed', 'doors', 'know', 'fact', 'either', 'extremely', 'nice', 'stupid', 'guy', 'one', 'sickest', 'liars', 'hope', 'latter']
#Now we are going to do our first class
class KaggleWord2VecUtility(object):
"""KaggleWord2VecUtility is a utility class for processing raw HTML text into segments for further learning"""
@staticmethod
def review_to_wordlist( review, remove_stopwords=False ):
# Function to convert a document to a sequence of words,
# optionally removing stop words. Returns a list of words.
#
# 1. Remove HTML
review_text = BeautifulSoup(review,"html.parser" ).get_text()
#
# 2. Remove non-letters
review_text = re.sub("[^a-zA-Z]"," ", review_text)
#
# 3. Convert words to lower case and split them
words = review_text.lower().split()
#
# 4. Optionally remove stop words (false by default)
if remove_stopwords:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
#
# 5. Return a list of words
return(words)
# Define a function to split a review into parsed sentences
@staticmethod
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
# Function to split a review into parsed sentences. Returns a
# list of sentences, where each sentence is a list of words
#
# 1. Use the NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
#
# 2. Loop over each sentence
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( KaggleWord2VecUtility.review_to_wordlist( raw_sentence, remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
clean_review_word = KaggleWord2VecUtility.review_to_wordlist \
( train["review"][0], True )
clean_review_sentence = KaggleWord2VecUtility.review_to_wordlist \
( train["review"][0], True )
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size
print ("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in range( 0, len(train["review"])):
if( (i+1)%1000 == 0 ):
print ("Review %d of %d\n" % ( i+1, num_reviews ) )
clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], True)))
Cleaning and parsing the training set movie reviews...
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000
clean_train_reviews[0:5]
['stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter',
'classic war worlds timothy hines entertaining film obviously goes great effort lengths faithfully recreate h g wells classic book mr hines succeeds watched film appreciated fact standard predictable hollywood fare comes every year e g spielberg version tom cruise slightest resemblance book obviously everyone looks different things movie envision amateur critics look criticize everything others rate movie important bases like entertained people never agree critics enjoyed effort mr hines put faithful h g wells classic novel found entertaining made easy overlook critics perceive shortcomings',
'film starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment results quite boring giant animals majority made computer generator seem totally lousy middling performances though players reacting appropriately becoming food actors give vigorously physical performances dodging beasts running bound leaps dangling walls packs ridiculous final deadly scene small kids realistic gory violent attack scenes films sabretooths smilodon following sabretooth james r hickox vanessa angel david keith john rhys davies much better bc roland emmerich steven strait cliff curtis camilla belle motion picture filled bloody moments badly directed george miller originality takes many elements previous films miller australian director usually working television tidal wave journey center earth many others occasionally cinema man snowy river zeus roxanne robinson crusoe rating average bottom barrel',
'must assumed praised film greatest filmed opera ever read somewhere either care opera care wagner care anything except desire appear cultured either representation wagner swan song movie strikes unmitigated disaster leaden reading score matched tricksy lugubrious realisation text questionable people ideas opera matter play especially one shakespeare allowed anywhere near theatre film studio syberberg fashionably without smallest justification wagner text decided parsifal bisexual integration title character latter stages transmutes kind beatnik babe though one continues sing high tenor actors film singers get double dose armin jordan conductor seen face heard voice amfortas also appears monstrously double exposure kind batonzilla conductor ate monsalvat playing good friday music way transcendant loveliness nature represented scattering shopworn flaccid crocuses stuck ill laid turf expedient baffles theatre sometimes piece imperfections thoughts think syberberg splice parsifal gurnemanz mountain pasture lush provided julie andrews sound music sound hard endure high voices trumpets particular possessing aural glare adds another sort fatigue impatience uninspired conducting paralytic unfolding ritual someone another review mentioned bayreuth recording knappertsbusch though tempi often slow jordan altogether lacks sense pulse feeling ebb flow music half century orchestral sound set modern pressings still superior film',
'superbly trashy wondrously unpretentious exploitation hooray pre credits opening sequences somewhat give false impression dealing serious harrowing drama need fear barely ten minutes later necks nonsensical chainsaw battles rough fist fights lurid dialogs gratuitous nudity bo ingrid two orphaned siblings unusually close even slightly perverted relationship imagine playfully ripping towel covers sister naked body stare unshaven genitals several whole minutes well bo sister judging dubbed laughter mind sick dude anyway kids fled russia parents nasty soldiers brutally slaughtered mommy daddy friendly smuggler took custody however even raised trained bo ingrid expert smugglers actual plot lifts years later facing ultimate quest mythical incredibly valuable white fire diamond coincidentally found mine things life ever made little sense plot narrative structure white fire sure lot fun watch time clue beating cause bet actors understood even less whatever violence magnificently grotesque every single plot twist pleasingly retarded script goes totally bonkers beyond repair suddenly reveal reason bo needs replacement ingrid fred williamson enters scene big cigar mouth sleazy black fingers local prostitutes bo principal opponent italian chick big breasts hideous accent preposterous catchy theme song plays least dozen times throughout film obligatory falling love montage loads attractions god brilliant experience original french title translates life survive uniquely appropriate makes much sense rest movie none']
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)
Creating the bag of words...
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
print ("Training the random forest (this may take a while)...")
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)
Training the random forest (this may take a while)...
# Fit the forest to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, train["sentiment"] )
# Create an empty list and append the clean reviews one by one
clean_test_reviews = []
print ("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,len(test["review"])):
clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], True)))
Cleaning and parsing the test set movie reviews...
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
# Use the random forest to make sentiment label predictions
print ("Predicting test labels...\n")
result = forest.predict(test_data_features)
# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
# Use pandas to write the comma-separated output file
output.to_csv('Bag_of_Words_model.csv', index=False, quoting=3)
print ("Wrote results to Bag_of_Words_model.csv")
Predicting test labels...
Wrote results to Bag_of_Words_model.csv
46.1. Word2Vec#
#!pip install gensim
import pandas as pd
import os
from nltk.corpus import stopwords
import nltk.data
import logging
import numpy as np # Make sure that numpy is imported
from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
“In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.” -[Wikipedia](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)
Punkt is a specific tokenizer. http://www.nltk.org/_modules/nltk/tokenize/punkt.html
# download punkt
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
True
#What is a tokenizer
# http://www.nltk.org/_modules/nltk/tokenize/punkt.html
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# ****** Split the labeled and unlabeled training sets into clean sentences
# Note this will take a while and produce some warnings.
sentences = [] # Initialize an empty list of sentences
print ("Parsing sentences from training set")
for review in train["review"]:
sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
Parsing sentences from training set
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:273: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
' Beautiful Soup.' % markup)
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://www.happierabroad.com"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
print ("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
Parsing sentences from unlabeled set
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:273: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
' Beautiful Soup.' % markup)
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://www.archive.org/details/LovefromaStranger"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://www.loosechangeguide.com/LooseChangeGuide.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://www.msnbc.msn.com/id/4972055/site/newsweek/"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:273: UserWarning: "b'..'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
' Beautiful Soup.' % markup)
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://www.youtube.com/watch?v=a0KSqelmgN8"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
/usr/local/lib/python3.6/dist-packages/bs4/__init__.py:336: UserWarning: "http://jake-weird.blogspot.com/2007/08/beneath.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
# ****** Define functions to create average word vectors
#
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
#
# Loop through the reviews
for review in reviews:
# Print a status message every 1000th review
if counter%1000. == 0.:
print ("Review %d of %d" % (counter, len(reviews)))
# Call the function (defined above) that makes average feature vectors
reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
# Increment the counter
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["review"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
# Initialize and train the model (this will take some time)
print ("Training Word2Vec model...")
model = Word2Vec(sentences, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling, seed=1)
2019-03-14 01:01:24,249 : INFO : collecting all words and their counts
2019-03-14 01:01:24,250 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-14 01:01:24,318 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 17776 word types
2019-03-14 01:01:24,382 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 24948 word types
2019-03-14 01:01:24,446 : INFO : PROGRESS: at sentence #30000, processed 671314 words, keeping 30034 word types
Training Word2Vec model...
2019-03-14 01:01:24,522 : INFO : PROGRESS: at sentence #40000, processed 897814 words, keeping 34348 word types
2019-03-14 01:01:24,587 : INFO : PROGRESS: at sentence #50000, processed 1116962 words, keeping 37761 word types
2019-03-14 01:01:24,654 : INFO : PROGRESS: at sentence #60000, processed 1338403 words, keeping 40723 word types
2019-03-14 01:01:24,725 : INFO : PROGRESS: at sentence #70000, processed 1561579 words, keeping 43333 word types
2019-03-14 01:01:24,795 : INFO : PROGRESS: at sentence #80000, processed 1780886 words, keeping 45714 word types
2019-03-14 01:01:24,865 : INFO : PROGRESS: at sentence #90000, processed 2004995 words, keeping 48135 word types
2019-03-14 01:01:24,933 : INFO : PROGRESS: at sentence #100000, processed 2226966 words, keeping 50207 word types
2019-03-14 01:01:25,005 : INFO : PROGRESS: at sentence #110000, processed 2446580 words, keeping 52081 word types
2019-03-14 01:01:25,077 : INFO : PROGRESS: at sentence #120000, processed 2668775 words, keeping 54119 word types
2019-03-14 01:01:25,145 : INFO : PROGRESS: at sentence #130000, processed 2894303 words, keeping 55847 word types
2019-03-14 01:01:25,215 : INFO : PROGRESS: at sentence #140000, processed 3107005 words, keeping 57346 word types
2019-03-14 01:01:25,288 : INFO : PROGRESS: at sentence #150000, processed 3332627 words, keeping 59055 word types
2019-03-14 01:01:25,356 : INFO : PROGRESS: at sentence #160000, processed 3555315 words, keeping 60617 word types
2019-03-14 01:01:25,425 : INFO : PROGRESS: at sentence #170000, processed 3778655 words, keeping 62077 word types
2019-03-14 01:01:25,498 : INFO : PROGRESS: at sentence #180000, processed 3999236 words, keeping 63496 word types
2019-03-14 01:01:25,570 : INFO : PROGRESS: at sentence #190000, processed 4224449 words, keeping 64794 word types
2019-03-14 01:01:25,641 : INFO : PROGRESS: at sentence #200000, processed 4448603 words, keeping 66087 word types
2019-03-14 01:01:25,710 : INFO : PROGRESS: at sentence #210000, processed 4669967 words, keeping 67390 word types
2019-03-14 01:01:25,779 : INFO : PROGRESS: at sentence #220000, processed 4894968 words, keeping 68697 word types
2019-03-14 01:01:25,845 : INFO : PROGRESS: at sentence #230000, processed 5117546 words, keeping 69958 word types
2019-03-14 01:01:25,917 : INFO : PROGRESS: at sentence #240000, processed 5345051 words, keeping 71167 word types
2019-03-14 01:01:25,985 : INFO : PROGRESS: at sentence #250000, processed 5559166 words, keeping 72351 word types
2019-03-14 01:01:26,058 : INFO : PROGRESS: at sentence #260000, processed 5779147 words, keeping 73478 word types
2019-03-14 01:01:26,125 : INFO : PROGRESS: at sentence #270000, processed 6000436 words, keeping 74767 word types
2019-03-14 01:01:26,200 : INFO : PROGRESS: at sentence #280000, processed 6226315 words, keeping 76369 word types
2019-03-14 01:01:26,272 : INFO : PROGRESS: at sentence #290000, processed 6449475 words, keeping 77839 word types
2019-03-14 01:01:26,344 : INFO : PROGRESS: at sentence #300000, processed 6674078 words, keeping 79171 word types
2019-03-14 01:01:26,417 : INFO : PROGRESS: at sentence #310000, processed 6899392 words, keeping 80480 word types
2019-03-14 01:01:26,492 : INFO : PROGRESS: at sentence #320000, processed 7124279 words, keeping 81808 word types
2019-03-14 01:01:26,564 : INFO : PROGRESS: at sentence #330000, processed 7346022 words, keeping 83030 word types
2019-03-14 01:01:26,640 : INFO : PROGRESS: at sentence #340000, processed 7575534 words, keeping 84280 word types
2019-03-14 01:01:26,710 : INFO : PROGRESS: at sentence #350000, processed 7798804 words, keeping 85425 word types
2019-03-14 01:01:26,783 : INFO : PROGRESS: at sentence #360000, processed 8019467 words, keeping 86596 word types
2019-03-14 01:01:26,859 : INFO : PROGRESS: at sentence #370000, processed 8246659 words, keeping 87708 word types
2019-03-14 01:01:26,932 : INFO : PROGRESS: at sentence #380000, processed 8471806 words, keeping 88878 word types
2019-03-14 01:01:27,006 : INFO : PROGRESS: at sentence #390000, processed 8701556 words, keeping 89907 word types
2019-03-14 01:01:27,079 : INFO : PROGRESS: at sentence #400000, processed 8924505 words, keeping 90916 word types
2019-03-14 01:01:27,151 : INFO : PROGRESS: at sentence #410000, processed 9145855 words, keeping 91880 word types
2019-03-14 01:01:27,225 : INFO : PROGRESS: at sentence #420000, processed 9366935 words, keeping 92912 word types
2019-03-14 01:01:27,298 : INFO : PROGRESS: at sentence #430000, processed 9594472 words, keeping 93932 word types
2019-03-14 01:01:27,375 : INFO : PROGRESS: at sentence #440000, processed 9821225 words, keeping 94906 word types
2019-03-14 01:01:27,447 : INFO : PROGRESS: at sentence #450000, processed 10044987 words, keeping 96036 word types
2019-03-14 01:01:27,521 : INFO : PROGRESS: at sentence #460000, processed 10277747 words, keeping 97088 word types
2019-03-14 01:01:27,592 : INFO : PROGRESS: at sentence #470000, processed 10505672 words, keeping 97933 word types
2019-03-14 01:01:27,664 : INFO : PROGRESS: at sentence #480000, processed 10726056 words, keeping 98862 word types
2019-03-14 01:01:27,736 : INFO : PROGRESS: at sentence #490000, processed 10952800 words, keeping 99871 word types
2019-03-14 01:01:27,809 : INFO : PROGRESS: at sentence #500000, processed 11174456 words, keeping 100765 word types
2019-03-14 01:01:27,883 : INFO : PROGRESS: at sentence #510000, processed 11399731 words, keeping 101699 word types
2019-03-14 01:01:27,956 : INFO : PROGRESS: at sentence #520000, processed 11623082 words, keeping 102598 word types
2019-03-14 01:01:28,029 : INFO : PROGRESS: at sentence #530000, processed 11847480 words, keeping 103400 word types
2019-03-14 01:01:28,100 : INFO : PROGRESS: at sentence #540000, processed 12072095 words, keeping 104265 word types
2019-03-14 01:01:28,171 : INFO : PROGRESS: at sentence #550000, processed 12297646 words, keeping 105133 word types
2019-03-14 01:01:28,245 : INFO : PROGRESS: at sentence #560000, processed 12518936 words, keeping 105997 word types
2019-03-14 01:01:28,318 : INFO : PROGRESS: at sentence #570000, processed 12748083 words, keeping 106787 word types
2019-03-14 01:01:28,390 : INFO : PROGRESS: at sentence #580000, processed 12969579 words, keeping 107665 word types
2019-03-14 01:01:28,463 : INFO : PROGRESS: at sentence #590000, processed 13195103 words, keeping 108502 word types
2019-03-14 01:01:28,538 : INFO : PROGRESS: at sentence #600000, processed 13417301 words, keeping 109219 word types
2019-03-14 01:01:28,611 : INFO : PROGRESS: at sentence #610000, processed 13638324 words, keeping 110093 word types
2019-03-14 01:01:28,686 : INFO : PROGRESS: at sentence #620000, processed 13864649 words, keeping 110838 word types
2019-03-14 01:01:28,758 : INFO : PROGRESS: at sentence #630000, processed 14088935 words, keeping 111611 word types
2019-03-14 01:01:28,831 : INFO : PROGRESS: at sentence #640000, processed 14309718 words, keeping 112417 word types
2019-03-14 01:01:28,904 : INFO : PROGRESS: at sentence #650000, processed 14535474 words, keeping 113197 word types
2019-03-14 01:01:28,981 : INFO : PROGRESS: at sentence #660000, processed 14758264 words, keeping 113946 word types
2019-03-14 01:01:29,054 : INFO : PROGRESS: at sentence #670000, processed 14981657 words, keeping 114644 word types
2019-03-14 01:01:29,128 : INFO : PROGRESS: at sentence #680000, processed 15206489 words, keeping 115355 word types
2019-03-14 01:01:29,203 : INFO : PROGRESS: at sentence #690000, processed 15428682 words, keeping 116132 word types
2019-03-14 01:01:29,279 : INFO : PROGRESS: at sentence #700000, processed 15657388 words, keeping 116944 word types
2019-03-14 01:01:29,353 : INFO : PROGRESS: at sentence #710000, processed 15880377 words, keeping 117597 word types
2019-03-14 01:01:29,427 : INFO : PROGRESS: at sentence #720000, processed 16105664 words, keeping 118222 word types
2019-03-14 01:01:29,501 : INFO : PROGRESS: at sentence #730000, processed 16332045 words, keeping 118955 word types
2019-03-14 01:01:29,571 : INFO : PROGRESS: at sentence #740000, processed 16553078 words, keeping 119669 word types
2019-03-14 01:01:29,641 : INFO : PROGRESS: at sentence #750000, processed 16771405 words, keeping 120296 word types
2019-03-14 01:01:29,712 : INFO : PROGRESS: at sentence #760000, processed 16990809 words, keeping 120931 word types
2019-03-14 01:01:29,788 : INFO : PROGRESS: at sentence #770000, processed 17217946 words, keeping 121704 word types
2019-03-14 01:01:29,864 : INFO : PROGRESS: at sentence #780000, processed 17448092 words, keeping 122403 word types
2019-03-14 01:01:29,940 : INFO : PROGRESS: at sentence #790000, processed 17675168 words, keeping 123067 word types
2019-03-14 01:01:29,983 : INFO : collected 123505 word types from a corpus of 17798269 raw words and 795538 sentences
2019-03-14 01:01:29,984 : INFO : Loading a fresh vocabulary
2019-03-14 01:01:30,088 : INFO : effective_min_count=40 retains 16490 unique words (13% of original 123505, drops 107015)
2019-03-14 01:01:30,089 : INFO : effective_min_count=40 leaves 17239123 word corpus (96% of original 17798269, drops 559146)
2019-03-14 01:01:30,148 : INFO : deleting the raw counts dictionary of 123505 items
2019-03-14 01:01:30,153 : INFO : sample=0.001 downsamples 48 most-common words
2019-03-14 01:01:30,154 : INFO : downsampling leaves estimated 12749797 word corpus (74.0% of prior 17239123)
2019-03-14 01:01:30,228 : INFO : estimated required memory for 16490 words and 300 dimensions: 47821000 bytes
2019-03-14 01:01:30,229 : INFO : resetting layer weights
2019-03-14 01:01:30,442 : INFO : training model with 4 workers on 16490 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-03-14 01:01:31,493 : INFO : EPOCH 1 - PROGRESS: at 2.39% examples, 295992 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:32,510 : INFO : EPOCH 1 - PROGRESS: at 4.92% examples, 306260 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:33,552 : INFO : EPOCH 1 - PROGRESS: at 7.54% examples, 309290 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:34,586 : INFO : EPOCH 1 - PROGRESS: at 10.15% examples, 311547 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:35,623 : INFO : EPOCH 1 - PROGRESS: at 12.63% examples, 310008 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:01:36,659 : INFO : EPOCH 1 - PROGRESS: at 15.17% examples, 310202 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:01:37,664 : INFO : EPOCH 1 - PROGRESS: at 17.74% examples, 311558 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:38,680 : INFO : EPOCH 1 - PROGRESS: at 20.27% examples, 312167 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:01:39,702 : INFO : EPOCH 1 - PROGRESS: at 22.82% examples, 312502 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:40,715 : INFO : EPOCH 1 - PROGRESS: at 25.28% examples, 312297 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:41,729 : INFO : EPOCH 1 - PROGRESS: at 27.85% examples, 313358 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:01:42,733 : INFO : EPOCH 1 - PROGRESS: at 30.38% examples, 313874 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:43,757 : INFO : EPOCH 1 - PROGRESS: at 33.03% examples, 314414 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:44,769 : INFO : EPOCH 1 - PROGRESS: at 35.61% examples, 315147 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:45,785 : INFO : EPOCH 1 - PROGRESS: at 38.12% examples, 315260 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:46,794 : INFO : EPOCH 1 - PROGRESS: at 40.63% examples, 315474 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:47,814 : INFO : EPOCH 1 - PROGRESS: at 43.19% examples, 315905 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:48,829 : INFO : EPOCH 1 - PROGRESS: at 45.71% examples, 315964 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:49,834 : INFO : EPOCH 1 - PROGRESS: at 48.22% examples, 316164 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:01:50,857 : INFO : EPOCH 1 - PROGRESS: at 50.79% examples, 316485 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:51,857 : INFO : EPOCH 1 - PROGRESS: at 53.32% examples, 316729 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:52,870 : INFO : EPOCH 1 - PROGRESS: at 55.77% examples, 316450 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:53,890 : INFO : EPOCH 1 - PROGRESS: at 58.35% examples, 316973 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:54,900 : INFO : EPOCH 1 - PROGRESS: at 60.85% examples, 317051 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:55,908 : INFO : EPOCH 1 - PROGRESS: at 63.45% examples, 317424 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:56,942 : INFO : EPOCH 1 - PROGRESS: at 65.95% examples, 317185 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:57,982 : INFO : EPOCH 1 - PROGRESS: at 68.55% examples, 317142 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:01:58,992 : INFO : EPOCH 1 - PROGRESS: at 71.05% examples, 317211 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:02:00,044 : INFO : EPOCH 1 - PROGRESS: at 73.62% examples, 317071 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:01,079 : INFO : EPOCH 1 - PROGRESS: at 76.10% examples, 316626 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:02:02,095 : INFO : EPOCH 1 - PROGRESS: at 78.67% examples, 316852 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:03,100 : INFO : EPOCH 1 - PROGRESS: at 81.21% examples, 316957 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:02:04,129 : INFO : EPOCH 1 - PROGRESS: at 83.78% examples, 317035 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:02:05,140 : INFO : EPOCH 1 - PROGRESS: at 86.36% examples, 317278 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:06,186 : INFO : EPOCH 1 - PROGRESS: at 88.98% examples, 317398 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:07,206 : INFO : EPOCH 1 - PROGRESS: at 91.49% examples, 317343 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:08,208 : INFO : EPOCH 1 - PROGRESS: at 93.97% examples, 317258 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:09,250 : INFO : EPOCH 1 - PROGRESS: at 96.58% examples, 317212 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:10,258 : INFO : EPOCH 1 - PROGRESS: at 99.05% examples, 317281 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:02:10,580 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-14 01:02:10,597 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-14 01:02:10,600 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-14 01:02:10,616 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-14 01:02:10,617 : INFO : EPOCH - 1 : training on 17798269 raw words (12750264 effective words) took 40.2s, 317459 effective words/s
2019-03-14 01:02:11,647 : INFO : EPOCH 2 - PROGRESS: at 2.39% examples, 302096 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:12,655 : INFO : EPOCH 2 - PROGRESS: at 4.87% examples, 306989 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:13,693 : INFO : EPOCH 2 - PROGRESS: at 7.48% examples, 310070 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:14,731 : INFO : EPOCH 2 - PROGRESS: at 10.09% examples, 311948 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:15,746 : INFO : EPOCH 2 - PROGRESS: at 12.62% examples, 313023 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:16,764 : INFO : EPOCH 2 - PROGRESS: at 15.07% examples, 311171 words/s, in_qsize 8, out_qsize 1
2019-03-14 01:02:17,771 : INFO : EPOCH 2 - PROGRESS: at 17.69% examples, 313406 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:18,793 : INFO : EPOCH 2 - PROGRESS: at 20.10% examples, 311737 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:19,805 : INFO : EPOCH 2 - PROGRESS: at 22.71% examples, 313223 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:20,811 : INFO : EPOCH 2 - PROGRESS: at 25.22% examples, 313893 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:21,835 : INFO : EPOCH 2 - PROGRESS: at 27.69% examples, 313243 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:02:22,841 : INFO : EPOCH 2 - PROGRESS: at 30.15% examples, 313188 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:23,870 : INFO : EPOCH 2 - PROGRESS: at 32.86% examples, 314192 words/s, in_qsize 8, out_qsize 2
2019-03-14 01:02:24,896 : INFO : EPOCH 2 - PROGRESS: at 35.50% examples, 315159 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:02:25,917 : INFO : EPOCH 2 - PROGRESS: at 38.06% examples, 315647 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:26,933 : INFO : EPOCH 2 - PROGRESS: at 40.63% examples, 316174 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:27,933 : INFO : EPOCH 2 - PROGRESS: at 43.14% examples, 316494 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:28,957 : INFO : EPOCH 2 - PROGRESS: at 45.66% examples, 316362 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:29,966 : INFO : EPOCH 2 - PROGRESS: at 48.11% examples, 316104 words/s, in_qsize 5, out_qsize 2
2019-03-14 01:02:30,976 : INFO : EPOCH 2 - PROGRESS: at 50.68% examples, 316605 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:31,982 : INFO : EPOCH 2 - PROGRESS: at 53.22% examples, 316748 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:32,986 : INFO : EPOCH 2 - PROGRESS: at 55.66% examples, 316613 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:02:34,020 : INFO : EPOCH 2 - PROGRESS: at 58.13% examples, 316341 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:02:35,059 : INFO : EPOCH 2 - PROGRESS: at 60.75% examples, 316648 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:36,091 : INFO : EPOCH 2 - PROGRESS: at 63.27% examples, 316447 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:37,106 : INFO : EPOCH 2 - PROGRESS: at 65.79% examples, 316469 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:38,121 : INFO : EPOCH 2 - PROGRESS: at 68.38% examples, 316762 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:39,150 : INFO : EPOCH 2 - PROGRESS: at 70.88% examples, 316630 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:40,152 : INFO : EPOCH 2 - PROGRESS: at 73.34% examples, 316537 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:41,178 : INFO : EPOCH 2 - PROGRESS: at 75.99% examples, 316917 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:42,232 : INFO : EPOCH 2 - PROGRESS: at 78.55% examples, 316766 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:43,247 : INFO : EPOCH 2 - PROGRESS: at 81.15% examples, 316987 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:02:44,248 : INFO : EPOCH 2 - PROGRESS: at 83.73% examples, 317331 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:45,260 : INFO : EPOCH 2 - PROGRESS: at 86.24% examples, 317358 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:46,261 : INFO : EPOCH 2 - PROGRESS: at 88.74% examples, 317480 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:47,291 : INFO : EPOCH 2 - PROGRESS: at 91.37% examples, 317718 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:48,311 : INFO : EPOCH 2 - PROGRESS: at 94.03% examples, 318043 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:49,316 : INFO : EPOCH 2 - PROGRESS: at 96.64% examples, 318282 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:50,351 : INFO : EPOCH 2 - PROGRESS: at 99.16% examples, 318275 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:50,572 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-14 01:02:50,593 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-14 01:02:50,623 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-14 01:02:50,633 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-14 01:02:50,634 : INFO : EPOCH - 2 : training on 17798269 raw words (12749055 effective words) took 40.0s, 318686 effective words/s
2019-03-14 01:02:51,702 : INFO : EPOCH 3 - PROGRESS: at 2.56% examples, 311596 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:52,713 : INFO : EPOCH 3 - PROGRESS: at 5.14% examples, 318149 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:53,731 : INFO : EPOCH 3 - PROGRESS: at 7.71% examples, 317251 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:54,737 : INFO : EPOCH 3 - PROGRESS: at 10.32% examples, 319682 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:55,782 : INFO : EPOCH 3 - PROGRESS: at 13.03% examples, 321521 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:56,795 : INFO : EPOCH 3 - PROGRESS: at 15.62% examples, 322057 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:57,799 : INFO : EPOCH 3 - PROGRESS: at 18.25% examples, 322810 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:02:58,830 : INFO : EPOCH 3 - PROGRESS: at 20.84% examples, 322321 words/s, in_qsize 5, out_qsize 2
2019-03-14 01:02:59,882 : INFO : EPOCH 3 - PROGRESS: at 23.53% examples, 322769 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:00,925 : INFO : EPOCH 3 - PROGRESS: at 26.18% examples, 322667 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:03:01,954 : INFO : EPOCH 3 - PROGRESS: at 28.82% examples, 322951 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:03:02,955 : INFO : EPOCH 3 - PROGRESS: at 31.37% examples, 322826 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:03,955 : INFO : EPOCH 3 - PROGRESS: at 33.93% examples, 322719 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:04,991 : INFO : EPOCH 3 - PROGRESS: at 36.56% examples, 322836 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:06,054 : INFO : EPOCH 3 - PROGRESS: at 39.19% examples, 322457 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:03:07,053 : INFO : EPOCH 3 - PROGRESS: at 41.81% examples, 323267 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:08,071 : INFO : EPOCH 3 - PROGRESS: at 44.32% examples, 322854 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:09,124 : INFO : EPOCH 3 - PROGRESS: at 46.94% examples, 322619 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:10,145 : INFO : EPOCH 3 - PROGRESS: at 49.55% examples, 322968 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:11,146 : INFO : EPOCH 3 - PROGRESS: at 52.16% examples, 323236 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:12,203 : INFO : EPOCH 3 - PROGRESS: at 54.70% examples, 322654 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:03:13,214 : INFO : EPOCH 3 - PROGRESS: at 57.26% examples, 322766 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:14,254 : INFO : EPOCH 3 - PROGRESS: at 59.78% examples, 322489 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:15,270 : INFO : EPOCH 3 - PROGRESS: at 62.43% examples, 322828 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:16,309 : INFO : EPOCH 3 - PROGRESS: at 65.06% examples, 322861 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:17,371 : INFO : EPOCH 3 - PROGRESS: at 67.69% examples, 322614 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:18,385 : INFO : EPOCH 3 - PROGRESS: at 70.33% examples, 322954 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:19,386 : INFO : EPOCH 3 - PROGRESS: at 72.78% examples, 322659 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:20,399 : INFO : EPOCH 3 - PROGRESS: at 75.36% examples, 322724 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:21,420 : INFO : EPOCH 3 - PROGRESS: at 77.94% examples, 322688 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:22,432 : INFO : EPOCH 3 - PROGRESS: at 80.53% examples, 322744 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:23,472 : INFO : EPOCH 3 - PROGRESS: at 83.11% examples, 322550 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:24,484 : INFO : EPOCH 3 - PROGRESS: at 85.74% examples, 322838 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:25,485 : INFO : EPOCH 3 - PROGRESS: at 88.23% examples, 322794 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:03:26,529 : INFO : EPOCH 3 - PROGRESS: at 90.82% examples, 322567 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:27,551 : INFO : EPOCH 3 - PROGRESS: at 93.40% examples, 322548 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:28,582 : INFO : EPOCH 3 - PROGRESS: at 96.14% examples, 322838 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:29,582 : INFO : EPOCH 3 - PROGRESS: at 98.56% examples, 322629 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:30,097 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-14 01:03:30,099 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-14 01:03:30,102 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-14 01:03:30,134 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-14 01:03:30,135 : INFO : EPOCH - 3 : training on 17798269 raw words (12747238 effective words) took 39.5s, 322804 effective words/s
2019-03-14 01:03:31,160 : INFO : EPOCH 4 - PROGRESS: at 2.39% examples, 302355 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:32,162 : INFO : EPOCH 4 - PROGRESS: at 4.98% examples, 315043 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:33,178 : INFO : EPOCH 4 - PROGRESS: at 7.60% examples, 317918 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:34,212 : INFO : EPOCH 4 - PROGRESS: at 10.15% examples, 316341 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:35,257 : INFO : EPOCH 4 - PROGRESS: at 12.81% examples, 317520 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:03:36,313 : INFO : EPOCH 4 - PROGRESS: at 15.40% examples, 316492 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:37,328 : INFO : EPOCH 4 - PROGRESS: at 18.02% examples, 317526 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:38,365 : INFO : EPOCH 4 - PROGRESS: at 20.62% examples, 317469 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:39,365 : INFO : EPOCH 4 - PROGRESS: at 23.27% examples, 319510 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:40,381 : INFO : EPOCH 4 - PROGRESS: at 25.80% examples, 319231 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:41,402 : INFO : EPOCH 4 - PROGRESS: at 28.31% examples, 319191 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:03:42,398 : INFO : EPOCH 4 - PROGRESS: at 30.92% examples, 319713 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:43,429 : INFO : EPOCH 4 - PROGRESS: at 33.55% examples, 319605 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:03:44,455 : INFO : EPOCH 4 - PROGRESS: at 36.11% examples, 319683 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:45,503 : INFO : EPOCH 4 - PROGRESS: at 38.73% examples, 319780 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:03:46,507 : INFO : EPOCH 4 - PROGRESS: at 41.31% examples, 320235 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:47,510 : INFO : EPOCH 4 - PROGRESS: at 43.81% examples, 320302 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:48,512 : INFO : EPOCH 4 - PROGRESS: at 46.34% examples, 320329 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:49,531 : INFO : EPOCH 4 - PROGRESS: at 48.93% examples, 320837 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:50,540 : INFO : EPOCH 4 - PROGRESS: at 51.42% examples, 320399 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:51,562 : INFO : EPOCH 4 - PROGRESS: at 53.99% examples, 320452 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:52,582 : INFO : EPOCH 4 - PROGRESS: at 56.55% examples, 320537 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:53,582 : INFO : EPOCH 4 - PROGRESS: at 58.95% examples, 320295 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:54,594 : INFO : EPOCH 4 - PROGRESS: at 61.58% examples, 320778 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:55,595 : INFO : EPOCH 4 - PROGRESS: at 64.17% examples, 321071 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:56,620 : INFO : EPOCH 4 - PROGRESS: at 66.80% examples, 321346 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:57,659 : INFO : EPOCH 4 - PROGRESS: at 69.31% examples, 321114 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:03:58,652 : INFO : EPOCH 4 - PROGRESS: at 71.82% examples, 321039 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:03:59,657 : INFO : EPOCH 4 - PROGRESS: at 74.46% examples, 321498 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:00,678 : INFO : EPOCH 4 - PROGRESS: at 76.93% examples, 321047 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:01,688 : INFO : EPOCH 4 - PROGRESS: at 79.46% examples, 320972 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:02,718 : INFO : EPOCH 4 - PROGRESS: at 82.04% examples, 320923 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:03,745 : INFO : EPOCH 4 - PROGRESS: at 84.62% examples, 320905 words/s, in_qsize 6, out_qsize 2
2019-03-14 01:04:04,767 : INFO : EPOCH 4 - PROGRESS: at 87.26% examples, 321151 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:05,779 : INFO : EPOCH 4 - PROGRESS: at 89.85% examples, 321475 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:04:06,798 : INFO : EPOCH 4 - PROGRESS: at 92.45% examples, 321522 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:07,823 : INFO : EPOCH 4 - PROGRESS: at 95.11% examples, 321694 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:08,832 : INFO : EPOCH 4 - PROGRESS: at 97.62% examples, 321613 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:09,697 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-14 01:04:09,713 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-14 01:04:09,744 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-14 01:04:09,746 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-14 01:04:09,746 : INFO : EPOCH - 4 : training on 17798269 raw words (12750455 effective words) took 39.6s, 321950 effective words/s
2019-03-14 01:04:10,813 : INFO : EPOCH 5 - PROGRESS: at 2.50% examples, 306098 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:11,831 : INFO : EPOCH 5 - PROGRESS: at 5.14% examples, 317685 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:12,855 : INFO : EPOCH 5 - PROGRESS: at 7.77% examples, 318829 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:13,858 : INFO : EPOCH 5 - PROGRESS: at 10.37% examples, 321125 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:14,910 : INFO : EPOCH 5 - PROGRESS: at 13.09% examples, 322318 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:15,934 : INFO : EPOCH 5 - PROGRESS: at 15.73% examples, 323321 words/s, in_qsize 8, out_qsize 1
2019-03-14 01:04:16,962 : INFO : EPOCH 5 - PROGRESS: at 18.41% examples, 323846 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:04:17,995 : INFO : EPOCH 5 - PROGRESS: at 21.12% examples, 324791 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:19,026 : INFO : EPOCH 5 - PROGRESS: at 23.82% examples, 325730 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:20,044 : INFO : EPOCH 5 - PROGRESS: at 26.41% examples, 325462 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:21,047 : INFO : EPOCH 5 - PROGRESS: at 28.98% examples, 325599 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:22,071 : INFO : EPOCH 5 - PROGRESS: at 31.67% examples, 325798 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:04:23,090 : INFO : EPOCH 5 - PROGRESS: at 34.27% examples, 325523 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:24,096 : INFO : EPOCH 5 - PROGRESS: at 36.85% examples, 325645 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:25,133 : INFO : EPOCH 5 - PROGRESS: at 39.36% examples, 324606 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:26,177 : INFO : EPOCH 5 - PROGRESS: at 42.02% examples, 324824 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:27,206 : INFO : EPOCH 5 - PROGRESS: at 44.56% examples, 324139 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:28,233 : INFO : EPOCH 5 - PROGRESS: at 47.22% examples, 324684 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:29,240 : INFO : EPOCH 5 - PROGRESS: at 49.71% examples, 324422 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:30,261 : INFO : EPOCH 5 - PROGRESS: at 52.32% examples, 324300 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:31,280 : INFO : EPOCH 5 - PROGRESS: at 54.81% examples, 323883 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:32,298 : INFO : EPOCH 5 - PROGRESS: at 57.31% examples, 323536 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:33,339 : INFO : EPOCH 5 - PROGRESS: at 59.78% examples, 322904 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:34,355 : INFO : EPOCH 5 - PROGRESS: at 62.30% examples, 322650 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:35,356 : INFO : EPOCH 5 - PROGRESS: at 64.89% examples, 322898 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:36,394 : INFO : EPOCH 5 - PROGRESS: at 67.51% examples, 322932 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:04:37,418 : INFO : EPOCH 5 - PROGRESS: at 70.10% examples, 322867 words/s, in_qsize 8, out_qsize 0
2019-03-14 01:04:38,469 : INFO : EPOCH 5 - PROGRESS: at 72.78% examples, 323021 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:39,482 : INFO : EPOCH 5 - PROGRESS: at 75.42% examples, 323334 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:40,522 : INFO : EPOCH 5 - PROGRESS: at 78.00% examples, 323077 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:41,525 : INFO : EPOCH 5 - PROGRESS: at 80.64% examples, 323465 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:42,536 : INFO : EPOCH 5 - PROGRESS: at 83.11% examples, 323107 words/s, in_qsize 7, out_qsize 1
2019-03-14 01:04:43,542 : INFO : EPOCH 5 - PROGRESS: at 85.68% examples, 323294 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:44,553 : INFO : EPOCH 5 - PROGRESS: at 88.29% examples, 323480 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:45,568 : INFO : EPOCH 5 - PROGRESS: at 90.87% examples, 323498 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:46,576 : INFO : EPOCH 5 - PROGRESS: at 93.46% examples, 323572 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:47,584 : INFO : EPOCH 5 - PROGRESS: at 96.03% examples, 323455 words/s, in_qsize 6, out_qsize 1
2019-03-14 01:04:48,607 : INFO : EPOCH 5 - PROGRESS: at 98.56% examples, 323409 words/s, in_qsize 7, out_qsize 0
2019-03-14 01:04:49,098 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-14 01:04:49,120 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-14 01:04:49,129 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-14 01:04:49,139 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-14 01:04:49,140 : INFO : EPOCH - 5 : training on 17798269 raw words (12748658 effective words) took 39.4s, 323747 effective words/s
2019-03-14 01:04:49,141 : INFO : training on a 88991345 raw words (63745670 effective words) took 198.7s, 320817 effective words/s
# If you don't plan to train the model any further, calling
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)
2019-03-14 01:04:49,148 : INFO : precomputing L2-norms of word weight vectors
# It can be helpful to create a meaningful model name and
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)
model.doesnt_match("man woman child kitchen".split())
2019-03-14 01:04:49,301 : INFO : saving Word2Vec object under 300features_40minwords_10context, separately None
2019-03-14 01:04:49,305 : INFO : not storing attribute vectors_norm
2019-03-14 01:04:49,308 : INFO : not storing attribute cum_table
2019-03-14 01:04:49,800 : INFO : saved 300features_40minwords_10context
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: DeprecationWarning: Call to deprecated `doesnt_match` (Method will be removed in 4.0.0, use self.wv.doesnt_match() instead).
after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
'kitchen'
model.doesnt_match("france england germany soccer".split())
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `doesnt_match` (Method will be removed in 4.0.0, use self.wv.doesnt_match() instead).
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
'soccer'
model.most_similar("soccer")
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
[('football', 0.8039368391036987),
('basketball', 0.6595849990844727),
('poker', 0.6242333054542542),
('baseball', 0.6185872554779053),
('coach', 0.6010462641716003),
('sports', 0.5769123435020447),
('hockey', 0.574334442615509),
('shaolin', 0.5512605905532837),
('champions', 0.5357030034065247),
('wrestling', 0.5321739912033081)]
model.most_similar("man")
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
[('woman', 0.6294258832931519),
('lad', 0.6104426383972168),
('lady', 0.5893934965133667),
('monk', 0.5500040650367737),
('farmer', 0.5429803729057312),
('guy', 0.521837592124939),
('person', 0.5184544324874878),
('chap', 0.5139154195785522),
('millionaire', 0.5120488405227661),
('politician', 0.5048956871032715)]
model["computer"]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
"""Entry point for launching an IPython kernel.
array([-4.79339510e-02, -2.53810696e-02, -9.83078852e-02, 7.23835602e-02,
-2.23066527e-02, 5.72536923e-02, 3.17422301e-02, -6.21821284e-02,
-3.82531472e-02, -2.85035223e-02, 6.96288198e-02, -8.61958042e-03,
-5.25614843e-02, 8.91598314e-02, 1.33388013e-01, -4.59663719e-02,
5.79963811e-02, 4.03241627e-02, 1.37066737e-01, -1.39122888e-01,
2.03789044e-02, -4.67378348e-02, 1.71173755e-02, -6.14754260e-02,
-8.26872066e-02, 4.17799037e-03, -6.88192695e-02, -4.71851118e-02,
5.73772416e-02, -5.46796173e-02, -3.86515111e-02, 9.65327471e-02,
2.05815546e-02, 2.76276264e-02, -4.25591022e-02, 2.51196809e-02,
3.82702723e-02, 4.93922504e-03, 1.12461247e-01, -1.01153374e-01,
3.35556641e-02, 5.55540156e-03, -3.79679240e-02, -2.24771872e-02,
8.73093233e-02, -5.04509695e-02, -4.05530483e-02, 6.14347458e-02,
8.23021308e-02, 5.65582402e-02, 2.69791781e-05, 4.77958284e-03,
7.75242895e-02, 1.00004807e-01, 5.92902536e-03, -1.05087563e-01,
-3.61691602e-02, -3.15997313e-04, 2.78723650e-02, 1.86387897e-02,
-1.70327332e-02, 7.38388335e-04, -5.49275465e-02, -5.79706319e-02,
5.06388247e-02, 1.26366448e-02, -7.05252960e-02, 1.17502294e-01,
9.84748304e-02, -1.39781563e-02, -3.13345194e-02, -6.53982442e-03,
-5.12062805e-03, 1.38693685e-02, 3.74527611e-02, 2.43657101e-02,
3.43408324e-02, 8.02342147e-02, -1.71090271e-02, -7.04095000e-03,
1.04522845e-02, 3.43474708e-02, -1.40156727e-02, 6.84063807e-02,
3.38155553e-02, 1.56406701e-01, 1.79476216e-02, -4.58403230e-02,
-4.15017596e-03, 2.82820929e-02, 5.71620092e-02, -1.77074708e-02,
-3.98685411e-02, -1.47484854e-01, -7.26067871e-02, 6.60759658e-02,
-1.58032253e-02, -6.50163293e-02, 4.24350947e-02, 1.12647265e-01,
4.10237350e-02, 2.98280325e-02, 3.11948135e-02, -7.24988058e-02,
1.41753890e-02, -3.46501283e-02, 1.00943372e-02, -4.47276309e-02,
-2.15441603e-02, -1.69435084e-01, -2.18652375e-02, 6.05334900e-02,
9.29217041e-02, -4.98547107e-02, 1.08486228e-02, -1.16886824e-01,
-7.03361072e-03, -6.75052553e-02, 5.19460291e-02, -3.68494578e-02,
3.72104347e-02, 4.01992016e-02, -3.65689546e-02, 5.73715456e-02,
-8.58063698e-02, -1.69679038e-02, 1.78947430e-02, -5.47890067e-02,
-1.79170370e-02, 6.18545935e-02, 1.40716946e-02, -5.53099699e-02,
3.28098610e-02, -8.61988366e-02, -1.38060516e-02, -7.77973756e-02,
3.01322360e-02, -2.97213811e-02, -5.12061007e-02, -7.44357333e-02,
-1.18710482e-02, 5.09540029e-02, 5.79989143e-02, -7.21113309e-02,
1.41676925e-02, 1.35432437e-01, -6.59376010e-03, 4.38096523e-02,
-6.01608269e-02, 2.00556521e-03, 1.74984355e-02, 4.49043103e-02,
1.09420130e-02, 8.18869006e-03, 1.52458772e-02, -2.65226234e-02,
-2.01061089e-02, -8.11586305e-02, 9.00103152e-02, 5.94500527e-02,
2.89643300e-03, 2.56725634e-03, -1.78116038e-02, 1.90644264e-02,
-2.26962790e-02, 3.18668643e-03, -7.71036968e-02, 1.93123269e-04,
2.58924272e-02, 3.18960845e-02, -8.47435892e-02, 3.89174446e-02,
-9.14918333e-02, -3.47692855e-02, -1.05178788e-01, -7.81329945e-02,
-2.56739929e-02, -2.41342392e-02, -5.45198135e-02, -3.73307355e-02,
2.69567072e-02, 4.54399176e-03, -4.00420465e-02, 5.33767864e-02,
-6.76332489e-02, 3.17393392e-02, 5.90901449e-02, 9.45589691e-02,
3.21563855e-02, 2.33184472e-02, 6.14998117e-03, 4.27653221e-03,
-5.73230572e-02, -5.30507602e-03, -8.49941671e-02, -2.04346944e-02,
9.48881656e-02, 5.88757657e-02, -1.11293055e-01, 1.90836471e-02,
-4.70569022e-02, -4.22467291e-02, 3.86966486e-03, -9.70810875e-02,
-7.43101314e-02, 5.24649993e-02, -3.29410285e-02, -2.38991026e-02,
-3.36377881e-02, 5.44135012e-02, -9.94117185e-02, 2.49974765e-02,
-4.42564599e-02, 8.74478966e-02, -3.85684483e-02, 3.34127881e-02,
8.24790299e-02, 5.33424094e-02, 1.31256990e-02, 2.96878815e-02,
2.22183373e-02, -2.11739577e-02, 8.04685652e-02, -3.71479578e-02,
7.73914915e-04, -4.12550308e-02, 5.00466563e-02, 2.51460876e-02,
3.03647742e-02, -1.26310617e-01, -1.64576340e-02, 6.23791106e-02,
5.53251542e-02, -7.69763142e-02, -6.70548202e-03, -6.63869753e-02,
1.24717746e-02, -1.68166086e-01, 1.68424156e-02, -6.28778115e-02,
9.77337211e-02, 4.71105427e-02, 8.92058201e-03, -1.37098841e-02,
3.06127220e-02, 1.73426419e-02, -2.89770309e-02, -1.85693940e-03,
-1.15272313e-04, -3.52074020e-02, -3.21245864e-02, -8.34794790e-02,
2.75801215e-02, -5.00087887e-02, 4.92503829e-02, -9.92316827e-02,
-2.22996939e-02, 1.18917022e-02, 3.88455503e-02, 1.20371632e-01,
5.02688885e-02, -1.04167230e-01, 4.91058379e-02, -1.05557941e-01,
-1.98254809e-02, -4.02186513e-02, 7.50040542e-03, -1.05268778e-02,
-7.07997382e-03, 9.48763639e-02, 4.84618731e-02, -3.02425772e-02,
6.50335848e-02, -3.24018374e-02, -3.04284208e-02, -9.73176360e-02,
6.17954507e-03, 6.41165003e-02, -6.53638914e-02, 3.88511792e-02,
1.19865581e-01, 6.39352277e-02, 3.71861272e-02, -6.28246441e-02,
-9.44986590e-04, 4.55985926e-02, 4.20346893e-02, -5.09408861e-02,
5.30342124e-02, 2.55930256e-02, 1.57046262e-02, 1.00914791e-01,
-5.51233701e-02, -5.27993068e-02, -1.68002993e-02, -6.69082999e-02,
-1.73713006e-02, 7.25108683e-02, -7.00484961e-02, 3.51085626e-02],
dtype=float32)
model.most_similar("car")
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
[('truck', 0.7517951726913452),
('jeep', 0.6925300359725952),
('bus', 0.6857529282569885),
('train', 0.6580698490142822),
('bike', 0.6402906179428101),
('plane', 0.6360883712768555),
('boat', 0.6104196310043335),
('helicopter', 0.6086692214012146),
('cars', 0.5979301929473877),
('chevy', 0.5919524431228638)]