AnalyticsDojo

Introduction to Text Mining in Python

rpi.analyticsdojo.com

42. Introduction to Text Mining in Python#

These exercises were adapted from Mining the Social Web, 2nd Edition See origional here Simplified BSD License that governs its use.

42.1. Key Terms for Text Mining#

  • A collection of documents – corpus

  • Document – a piece of text

  • Terms/tokens – a word in a document

  • Entity – Some type of person, place, or organization

 corpus = { 
 'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
 'b' : "Professor Plum has a green plant in his study.",
 'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
}

#This will separate the documents (sentences) into terms/tokins/words.
terms = {
 'a' : [ i.lower() for i in corpus['a'].split() ],
 'b' : [ i.lower() for i in corpus['b'].split() ],
 'c' : [ i.lower() for i in corpus['c'].split() ]
 }
terms
{'a': ['mr.',
  'green',
  'killed',
  'colonel',
  'mustard',
  'in',
  'the',
  'study',
  'with',
  'the',
  'candlestick.',
  'mr.',
  'green',
  'is',
  'not',
  'a',
  'very',
  'nice',
  'fellow.'],
 'b': ['professor',
  'plum',
  'has',
  'a',
  'green',
  'plant',
  'in',
  'his',
  'study.'],
 'c': ['miss',
  'scarlett',
  'watered',
  'professor',
  "plum's",
  'green',
  'plant',
  'while',
  'he',
  'was',
  'away',
  'from',
  'his',
  'office',
  'last',
  'week.']}

42.2. Term Frequency#

  • A very common factor is to determine how frequently a word or term occurs with a document.

  • This is how early web search engines worked. (Not very well).

  • A common basic standarization method is to control for the number of words in the document.

from math import log

#This is our terms we would like to use.
QUERY_TERMS = ['mr.', 'green']

#This calculates the term frequency normalized by the length.
def tf(term, doc, normalize):
    doc = doc.lower().split()
    if normalize:
        return doc.count(term.lower()) / float(len(doc))
    else:
        return doc.count(term.lower()) / 1.0
 
#This prints the basic documents. We can see that Mr. Green is in the first document.
for (k, v) in sorted(corpus.items()):
    print (k, ':', v)
print('\n')
    
a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.
b : Professor Plum has a green plant in his study.
c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week.
# Score queries by calculating cumulative tf (normalized and unnormalized).
query_scores = {'a': 0, 'b': 0, 'c': 0}

#This starts the search for each query
for term in [t.lower() for t in QUERY_TERMS]:
    #This starts the search for each document in the corpus
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], True))
        
print('\n')   #Let's skip a line.     
print ("This does the same thing but unnormalized.")
for term in [t.lower() for t in QUERY_TERMS]:
    #This starts the search for each document in the corpus
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], False))
TF(a): mr. 0.10526315789473684
TF(b): mr. 0.0
TF(c): mr. 0.0
TF(a): green 0.10526315789473684
TF(b): green 0.1111111111111111
TF(c): green 0.0625


This does the same thing but unnormalized.
TF(a): mr. 2.0
TF(b): mr. 0.0
TF(c): mr. 0.0
TF(a): green 2.0
TF(b): green 1.0
TF(c): green 1.0

42.3. TF-IDF#

  • TF-IDF incorporates the inverse document frequency in the analysis. This type of factor would limit the impact of frequent words that would show up in a large number of documents.

  • The tf-idf calc involves multiplying against a tf value less than 0, so it’s necessary to return a value greater than 1 for consistent scoring. (Multiplying two values less than 1 returns a value less than each of them.)

def idf(term, corpus):
    
    num_texts_with_term = len([True for text in corpus if term.lower()
                              in text.lower().split()])
    try:
        return 1.0 + log(float(len(corpus)) / num_texts_with_term)
    except ZeroDivisionError:
        return 1.0

    
for term in [t.lower() for t in QUERY_TERMS]:
        print ('IDF: %s' % (term, ), idf(term, corpus.values()))
        
IDF: mr. 2.09861228866811
IDF: green 1.0
#TF-IDF Just multiplies the two together
def tf_idf(term, doc, corpus):
    return tf(term, doc, True) * idf(term, corpus)

query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], True))
    print ('IDF: %s' % (term, ), idf(term, corpus.values()))
    print('\n')

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print ('TF-IDF(%s): %s' % (doc, term), score)
        query_scores[doc] += score
        print('\n')

print ("Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), ))
for (doc, score) in sorted(query_scores.items()):
    print (doc, score)
TF(a): mr. 0.10526315789473684
TF(b): mr. 0.0
TF(c): mr. 0.0
IDF: mr. 2.09861228866811


TF-IDF(a): mr. 0.22090655670190631


TF-IDF(b): mr. 0.0


TF-IDF(c): mr. 0.0


TF(a): green 0.10526315789473684
TF(b): green 0.1111111111111111
TF(c): green 0.0625
IDF: green 1.0


TF-IDF(a): green 0.10526315789473684


TF-IDF(b): green 0.1111111111111111


TF-IDF(c): green 0.0625


Overall TF-IDF scores for query 'mr. green'
a 0.3261697145966431
b 0.1111111111111111
c 0.0625