Commit 867000bd by Steven Bird

added gensim example, cf #971

parent c9e836a5
.. Copyright (C) 2001-2015 NLTK Project
.. For license information, see LICENSE.TXT
=========================================
Test the word embedding function through Gensim package
=========================================
>>> import gensim
Overview
~~~~~~~~
Use Gensim package, we demo 3 functions.
- Train the word embeddings using brown corpus.
- Load the pre-trained model and perform simple tasks.
- Pruning the pre-trained binary model.
Train the model
~~~~~~~~~~~~~~~~~~
The word embedding is trained on Brown corpus
>>> from nltk.corpus import brown
>>> model = gensim.models.Word2Vec(brown.sents())
It might take sometime to train the model, after the model is trained, probably you want to save and then use it latter
>>> model.save('brown.embedding')
>>> new_model = gensim.models.Word2Vec.load('brown.embedding')
The model will be the list of words with their embedding. We can easily get the vector representation of a word.
>>> len(new_model['university'])
100
There are some supporting functions already implemented in Gensim to manipulate with word embeddings.
For example, to compute the cosine similarity between 2 words
>>> new_model.similarity('university','school') > 0.3
True
Using the pre-trained model
~~~~~~~~~~~~~~~~~~~
NLTK also include a pre-trained model which is part of a model that is trained on 100 billion words from Google News Dataset.
The full model is from https://code.google.com/p/word2vec/ which is about 3 Gb.
>>> from nltk.corpus import word2vec_sample
>>> model = gensim.models.Word2Vec.load(word2vec_sample)
We pruned the model to only include the most common words (~44k words).
>>> len(model.vocab)
43981
Each of the word is represented in the space of 300 dimensions.
>>> len(model['university'])
300
Finding the top n word that similar to a target word is simple. The result is the list of n words with the score.
>>> model.most_similar(positive=['university'], topn = 3)
[(u'universities', 0.7003918886184692), (u'faculty', 0.6780908703804016), (u'undergraduate', 0.6587098240852356)]
Find a word that is not in a list is also supported, although, implementing this by yourself is simple.
>>> model.doesnt_match('breakfast cereal dinner lunch'.split())
'cereal'
Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,
Vector 'King - Man + Woman' results close to 'Queen' or 'Germany - Berlin + Paris' closes to vector 'France'.
>>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
[(u'queen', 0.7118192911148071)]
>>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
[(u'France', 0.7884092926979065)]
We can visualize the word embeddings using t-SNE (http://lvdmaaten.github.io/tsne/). For demo, we just visualize the first 1000 words.
You can just change it to a bigger value.
import numpy as np
labels = []
count = 0
max_count = 1000
X = np.zeros(shape=(max_count,len(model['university'])))
for term in model.vocab:
X[count] = model[term]
labels.append(term)
count+= 1
if count >= max_count: break
# It is recommended to use PCA first to reduce to ~50 dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
X_50 = pca.fit_transform(X)
# Using TSNE to further reduce to 2 dimensions
from sklearn.manifold import TSNE
model_tsne = TSNE(n_components=2, random_state=0)
Y = model_tsne.fit_transform(X_50)
# Show the scatter plot
import matplotlib.pyplot as plt
plt.scatter(Y[:,0], Y[:,1], 20)
# Add labels
for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):
plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)
plt.show()
Prune the trained binary model
~~~~~~~~~~~~~~~~~
Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/
We use this code to get the `word2vec_sample` model.
import gensim
from gensim.models.word2vec import Word2Vec
# Load the binary model
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True);
# Only output word that appear in the Brown corpus
from nltk.corpus import brown
words = set(brown.words())
print (len(words))
# Output presented word to a temporary file
out_file = 'pruned.word2vec.txt'
f = open(out_file,'wb')
word_presented = words.intersection(model.vocab.keys())
f.write('{} {}\n'.format(len(word_presented),len(model['word'])))
for word in word_presented:
f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))
f.close()
# Reload the model from text file
new_model = Word2Vec.load_word2vec_format(out_file, binary=False);
# Save it as the Gensim model
gensim_model = "pruned.word2vec.bin"
new_model.save(gensim_model)
# Load the model
very_new_model = gensim.models.Word2Vec.load(gensim_model)
# Test it
very_new_model.most_similar(positive=['king','woman'], negative=['man'], topn=1)
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment