added gensim example, cf #971

867000bd · Steven Bird · c9e836a5 · 867000bd
Commit 867000bd authored May 18, 2015 by Steven Bird
Hide whitespace changes
Inline Side-by-side

Showing with 145 additions and 0 deletions

nltk/test/gensim.doctest.txt
+145 -0

No files found.
--- a/nltk/test/gensim.doctest.txt
+++ b/nltk/test/gensim.doctest.txt
+.. Copyright (C) 2001-2015 NLTK Project
+.. For license information, see LICENSE.TXT
+
+=========================================
+ Test the word embedding function through Gensim package
+=========================================
+
+    >>> import gensim  
+
+Overview
+~~~~~~~~
+Use Gensim package, we demo 3 functions. 
+- Train the word embeddings using brown corpus. 
+- Load the pre-trained model and perform simple tasks.
+- Pruning the pre-trained binary model.      
+ 
+Train the model 
+~~~~~~~~~~~~~~~~~~
+The word embedding is trained on Brown corpus 
+
+    >>> from nltk.corpus import brown
+    >>> model = gensim.models.Word2Vec(brown.sents())
+
+It might take sometime to train the model, after the model is trained, probably you want to save and then use it latter  
+	>>> model.save('brown.embedding')
+    >>> new_model =  gensim.models.Word2Vec.load('brown.embedding')
+
+The model will be the list of words with their embedding. We can easily get the vector representation of a word.  
+	>>> len(new_model['university']) 
+	100
+	
+There are some supporting functions already implemented in Gensim to manipulate with word embeddings. 
+For example, to compute the cosine similarity between 2 words 
+	>>> new_model.similarity('university','school') > 0.3
+	True
+	
+Using the pre-trained model
+~~~~~~~~~~~~~~~~~~~
+NLTK also include a pre-trained model which is part of a model that is trained on 100 billion words from Google News Dataset. 
+The full model is from https://code.google.com/p/word2vec/  which is about 3 Gb.    
+ 
+	>>> from nltk.corpus import word2vec_sample
+	>>> model = gensim.models.Word2Vec.load(word2vec_sample)
+	
+We pruned the model to only include the most common words (~44k words). 
+	>>> len(model.vocab)
+	43981
+ 
+Each of the word is represented in the space of 300 dimensions. 
+ 	>>> len(model['university'])
+ 	300
+ 	
+Finding the top n word that similar to a target word is simple. The result is the list of n words with the score.  
+ 	>>> model.most_similar(positive=['university'], topn = 3)
+ 	[(u'universities', 0.7003918886184692), (u'faculty', 0.6780908703804016), (u'undergraduate', 0.6587098240852356)]
+ 	 
+Find a word that is not in a list is also supported, although, implementing this by yourself is simple. 
+	>>> model.doesnt_match('breakfast cereal dinner lunch'.split())
+	'cereal'
+
+Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example, 
+Vector 'King - Man + Woman' results close to 'Queen' or 'Germany - Berlin + Paris' closes to vector 'France'.     
+    >>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
+    [(u'queen', 0.7118192911148071)]
+    
+    >>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
+    [(u'France', 0.7884092926979065)]
+    
+We can visualize the word embeddings using t-SNE (http://lvdmaaten.github.io/tsne/). For demo, we just visualize the first 1000 words. 
+You can just change it to a bigger value.  
+
+	import numpy as np  
+	labels = [] 
+	count = 0 
+	max_count = 1000
+	X = np.zeros(shape=(max_count,len(model['university'])))
+	    
+	for term in model.vocab:
+		X[count] = model[term]
+		labels.append(term)	
+		count+= 1 
+		if count >= max_count: break 	
+	
+	# It is recommended to use PCA first to reduce to ~50 dimensions  
+	from sklearn.decomposition import PCA
+	pca = PCA(n_components=50)
+	X_50 = pca.fit_transform(X)
+	
+	# Using TSNE to further reduce to 2 dimensions
+	from sklearn.manifold import TSNE  
+	model_tsne = TSNE(n_components=2, random_state=0)
+	Y = model_tsne.fit_transform(X_50)
+	
+	# Show the scatter plot 
+	import matplotlib.pyplot as plt
+	plt.scatter(Y[:,0], Y[:,1], 20)
+	
+	# Add labels 
+	for label, x, y in zip(labels, Y[:, 0], Y[:, 1]): 
+		plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)	
+	
+	plt.show()
+
+
+Prune the trained binary model
+~~~~~~~~~~~~~~~~~
+Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from  https://code.google.com/p/word2vec/
+We use this code to get the  `word2vec_sample` model. 
+
+	import gensim
+	from gensim.models.word2vec import Word2Vec
+	# Load the binary model 
+    model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True);
+    
+    # Only output word that appear in the Brown corpus 
+    from nltk.corpus import brown
+    words = set(brown.words())
+    print (len(words))
+    
+    # Output presented word to a temporary file
+    out_file = 'pruned.word2vec.txt'
+    f = open(out_file,'wb')
+
+    word_presented = words.intersection(model.vocab.keys())                       
+    f.write('{} {}\n'.format(len(word_presented),len(model['word'])))
+    
+    for word in word_presented:        
+        f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))
+                        
+    f.close()
+    
+    # Reload the model from text file 
+    new_model = Word2Vec.load_word2vec_format(out_file, binary=False);
+    
+    # Save it as the Gensim model
+    gensim_model = "pruned.word2vec.bin" 
+    new_model.save(gensim_model)
+    
+    # Load the model 
+    very_new_model = gensim.models.Word2Vec.load(gensim_model)
+    
+    # Test it 
+    very_new_model.most_similar(positive=['king','woman'], negative=['man'], topn=1)	
+	
\ No newline at end of file