Document Retrieval: A Comparison of Simple Word Count and Term Frequency-Inverse Document Frequency

This post demonstrates the difference between simple word count and term frequency-inverse document frequency(tf-idf) methods in document retrieval using an example. We use document retrieval method to retrieve /recommend/ similar articles with the article of interest.

Word count approach counts the number of words found in a document and retrieve similar documents based on frequent words commonly available in the documents whereas tf-idf magnifies the most important words that are common locally and rare globally. In other words, important words are informative words that describe the unique aspect of a document.

Mathematically, we can compute term frequency (simple word count) and tf-idf as:

$$tf = f_{w,d} $$$$idf= log(N /(1+D))$$$$tfidf= tf* idf $$

Where: tf is term frequency (word count); $f_{w,d} $ is the frequency of a word (w) in a document(d); idf is inverse document frequency (one is added in the denominator to avoid division by zero if the word is not in the corpus); N is the total number of documents in the corpus; D is the total number of documents containing the word; tfidf is term frequency-inverse document frequency.

Loading the data

The data was obtained from Machine Learning Foundation: A Case Study Approach course offered by University of Washington on Coursera. The analysis was performed using graphlab library, a python library that performs a range of machine learning tasks. For more information about graphlab, refer here.

The data contains three columns; url, name and text to refer to the link to wikipedia article, the name of the person and text of article, respectively. There are 59,071 observations.

In [1]:
import graphlab
In [3]:
people= graphlab.SFrame('people_wiki.gl') # load the data 
In [4]:
people.shape # the dimension of the data frame
Out[4]:
(59071, 3)
In [5]:
people.head(2) # the first two observations 
Out[5]:
URI name text
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
[2 rows x 3 columns]

Counting the words in each article and computing their tf-idf

The following two graphlab functions perform two tasks. The first counts the words in the text column of the data and the second computes tf-idf. The results are then added to the original data as new columns.

In [6]:
people["word_count"]=graphlab.text_analytics.count_words(people["text"])
people["tf-idf"]= graphlab.text_analytics.tf_idf(people["word_count"])
In [7]:
people.head(2) # the new data frame with word count and tfidf included 
Out[7]:
URI name text word_count
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
{'selection': 1,
'carltons': 1, 'being': ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
{'precise': 1, 'thomas':
1, 'closely': 1, ...
tf-idf
{'selection':
3.836578553093086, ...
{'precise':
6.44320060695519, ...
[2 rows x 5 columns]

Case Example: Retrieving articles similar to Hillary Clinton

In order to explore the difference between document retrieval with simple word count and tf-idf, I randomly picked an article written about the 67th United States Secretary of State, Hillary Clinton. Please note that the information on the articles may not reflect the current status of the people.

In [8]:
HClinton = people[people["name"]=="Hillary Rodham Clinton"] # Querying information about Hillary Clinton
In [9]:
HClinton
Out[9]:
URI name text word_count
<http://dbpedia.org/resou
rce/Hillary_Rodham_Cl ...
Hillary Rodham Clinton hillary diane rodham
clinton hlri dan rdm ...
{'serving': 1, 'office':
1, 'diplomacy': 1, ...
tf-idf
{'serving':
2.8470548673505855, ...
[? rows x 5 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
In [12]:
# The first few lines of Hillary Clinton's article
HClinton["text"][0][0:800]  
Out[12]:
'hillary diane rodham clinton hlri dan rdm klntn born october 26 1947 is a former united states secretary of state us senator and first lady of the united states from 2009 to 2013 she was the 67th secretary of state serving under president barack obama she previously represented new york in the us senate 2001 to 2009 before that as the wife of the 42nd president of the united states bill clinton she was first lady from 1993 to 2001 in the 2008 election clinton was a leading candidate for the democratic presidential nominationa native of illinois hillary rodham was the first student commencement speaker at wellesley college in 1969 and earned a jd from yale law school in 1973 after a brief stint as a congressional legal counsel she moved to arkansas and married bill clinton in 1975 rodham co'

Looking into the most frequent words

As you can see from the table below, "the", "in" and "of" are the most frequent words in Hillary Clinton's article and they are not informative to retrieve similar documents.

In [13]:
# convert the word count dictionary into a table having two columns, word and count.
hclinton_wc= HClinton[['word_count']].stack('word_count', new_column_name=['word', 'count']) 

# Word count of Hillary Clintons's article in descending order 
hclinton_wc.sort('count', ascending=False)
Out[13]:
word count
the 47
in 24
of 21
to 18
and 17
she 15
clinton 12
first 11
was 10
as 10
[285 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Looking into the most informative words

Informative words describe the uniqueness of a document for which similar documents will be retrieved. The tradeoff between rare globally and common locally can be balanced by weighing words by their tf-idf score. Words with large tf-idf score are more informative than words with low tf-idf score. In the article of Hillary Clinton, "clinton", "lady" and "she" are the most informative words.

In [15]:
# convert the "tf-idf" dictionary into a table having two columns, word and tf-idf.
hclinton_tfidf=HClinton[["tf-idf"]].stack("tf-idf",new_column_name=["word", "tf-idf"])

# Sort by tf-idf in descending order.
hclinton_tfidf.sort('tf-idf', ascending=False)
Out[15]:
word tf-idf
clinton 54.5083695903
lady 29.9147743198
she 23.7298085726
rodham 23.0719755697
arkansas 15.4462948262
secretary 14.9762072969
female 14.0107548062
us 13.5239331416
primaries 13.3380145514
hillary 12.005777535
[285 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Building nearest neighbors models with word count and tf-idf

In this post, cosine similarity was used to compute the similarity of documents in the corpus. It is the most commonly used similarity measure in information retrieval. Graphlab computes the cosine distance which is one minus the cosine similarity. The smaller the cosine distance, the similar the documents are.

In [16]:
# knn model using word_count as feature 
knn_model_wc = graphlab.nearest_neighbors.create(people, features=['word_count'], label='name',distance='cosine')  
Starting brute force nearest neighbors model training.
In [17]:
# knn model using tfidf as feature
knn_model_tfidf = graphlab.nearest_neighbors.create(people, features=['tf-idf'], label='name',distance='cosine')  
Starting brute force nearest neighbors model training.

Retrieving similar articles to Hillary Clinton using word count

When I use word frequency as a feature to make neighbor search, I found the articles of Adrienne Corri, Maria Damanaki, Diana B. Henriques and Raylawni Branch close to Hillary Clinton’s article.

In [18]:
# the top four similar articles with Hillary Clinton using word_count as feature
query_wc=knn_model_wc.query(HClinton, verbose=False)
query_wc
Out[18]:
query_label reference_label distance rank
0 Hillary Rodham Clinton 0.0 1
0 Adrienne Corri 0.139590411438 2
0 Maria Damanaki 0.144401350006 3
0 Diana B. Henriques 0.14716344102 4
0 Raylawni Branch 0.147796309614 5
[5 rows x 4 columns]

The following line of code prints the name of the people and the first few line of their article that are suggested as similar with Hillary Clinton's article including hers.

As we can see from the text, the articles retrieved using word count are not very similar with Hillary Clinton. For example, the closest article is that of Adrienne Corri while she is an actress, Maria Damanaki is the global managing director for oceans and Diana B. Henriques is an American financial journalist and so on.

In [19]:
similar_wc= query_wc["reference_label"]
for i in range(len(similar_wc)):
    print("{name}: {article}\n".format(name=similar_wc[i], article=people[people["name"]==similar_wc[i]]["text"][0][0:200]))
Hillary Rodham Clinton: hillary diane rodham clinton hlri dan rdm klntn born october 26 1947 is a former united states secretary of state us senator and first lady of the united states from 2009 to 2013 she was the 67th secr

Adrienne Corri: adrienne corri born 13 november 1931 glasgow scotland is a british actress she was born adrienne riccoboni the daughter of olive smethurst and italian father luigi riccoboni sometimes spelt reccobini 

Maria Damanaki: maria damanaki greek is the global managing director for oceans at the nature conservancydamanaki leads a global team focused on how the world manages its oceans including sustainable fisheries manage

Diana B. Henriques: diana blackmon henriques born december 1948 is an american financial journalist and author working in new york city since 1989 she has been a reporter on the staff of the new york times working on sta

Raylawni Branch: raylawni branch born 1941 hattiesburg forrest county mississippi united states is a black mississippi pioneer of the africanamerican civil rights movement 19551968 a professional nursing educator and 

Retrieving similar articles to Hillary Clinton using tfidf

The tf-idf model suggested that the articles of Bill Clinton, Melanne Verveer, Barack Obama and Sheffield Nelson are the top four closest articles to Hillary Clinton.

In [20]:
#similar articles to Hillary Clinton using tfidf as feature
query_tfidf=knn_model_tfidf.query(HClinton,verbose=False)
query_tfidf
Out[20]:
query_label reference_label distance rank
0 Hillary Rodham Clinton 0.0 1
0 Bill Clinton 0.56949689935 2
0 Melanne Verveer 0.740283986008 3
0 Barack Obama 0.758358397887 4
0 Sheffield Nelson 0.777915202357 5
[5 rows x 4 columns]

The articles retrieved using tfidf make more sense. The closest article with hers is that of Bill Clinton as he is her spouse and politician. Other people listed above such as Barack Obama are politicians and may have more similarity with her.

In [21]:
# Few lines of the articles of people similar to that of Hillary Clinton including hers.
similar_tfidf= query_tfidf["reference_label"]
for i in range(len(similar_tfidf)):
    print("{name}: {article}\n".format(name=similar_tfidf[i], article=people[people["name"]==similar_tfidf[i]]["text"][0][0:170]))
Hillary Rodham Clinton: hillary diane rodham clinton hlri dan rdm klntn born october 26 1947 is a former united states secretary of state us senator and first lady of the united states from 2009

Bill Clinton: william jefferson bill clinton born william jefferson blythe iii august 19 1946 is an american politician who served from 1993 to 2001 as the 42nd president of the united

Melanne Verveer: melanne verveer is the executive director of the georgetown institute for women peace and security at georgetown university and a founding partner of seneca point global 

Barack Obama: barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in h

Sheffield Nelson: e sheffield nelson born april 23 1941 is an american attorney businessman and politician from little rock arkansas originally a democrat nelson in 1990 ran for governor o

Conclusion

Analyzing document similarity just by word count can be misleading as common words such as "the”, “of" and "in" may dominate rare informative words. Term frequency-inverse document frequency(tf-idf) addresses the problem by magnifying the most important words in the document and helps better retrieve similar documents. As we can see in the example of Hillary Clinton's article, simple word count retrieves the articles of people who are actress, director of ocean, financial journalist and so on. On the other hand, the articles retrieved using tfidf are that of politicians and Bill Clinton's article was the first to be suggested. From our common understanding, we can validate the outcome that someone reading about Hillary Clinton may be interested to read about Bill Clinton (her spouse) and other politicians including Barack Obama as retrieved using tf-idf.