Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
N
nltk
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
edx
nltk
Commits
867000bd
Commit
867000bd
authored
May 18, 2015
by
Steven Bird
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
added gensim example, cf #971
parent
c9e836a5
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
145 additions
and
0 deletions
+145
-0
nltk/test/gensim.doctest.txt
+145
-0
No files found.
nltk/test/gensim.doctest.txt
0 → 100644
View file @
867000bd
.. Copyright (C) 2001-2015 NLTK Project
.. For license information, see LICENSE.TXT
=========================================
Test the word embedding function through Gensim package
=========================================
>>> import gensim
Overview
~~~~~~~~
Use Gensim package, we demo 3 functions.
- Train the word embeddings using brown corpus.
- Load the pre-trained model and perform simple tasks.
- Pruning the pre-trained binary model.
Train the model
~~~~~~~~~~~~~~~~~~
The word embedding is trained on Brown corpus
>>> from nltk.corpus import brown
>>> model = gensim.models.Word2Vec(brown.sents())
It might take sometime to train the model, after the model is trained, probably you want to save and then use it latter
>>> model.save('brown.embedding')
>>> new_model = gensim.models.Word2Vec.load('brown.embedding')
The model will be the list of words with their embedding. We can easily get the vector representation of a word.
>>> len(new_model['university'])
100
There are some supporting functions already implemented in Gensim to manipulate with word embeddings.
For example, to compute the cosine similarity between 2 words
>>> new_model.similarity('university','school') > 0.3
True
Using the pre-trained model
~~~~~~~~~~~~~~~~~~~
NLTK also include a pre-trained model which is part of a model that is trained on 100 billion words from Google News Dataset.
The full model is from https://code.google.com/p/word2vec/ which is about 3 Gb.
>>> from nltk.corpus import word2vec_sample
>>> model = gensim.models.Word2Vec.load(word2vec_sample)
We pruned the model to only include the most common words (~44k words).
>>> len(model.vocab)
43981
Each of the word is represented in the space of 300 dimensions.
>>> len(model['university'])
300
Finding the top n word that similar to a target word is simple. The result is the list of n words with the score.
>>> model.most_similar(positive=['university'], topn = 3)
[(u'universities', 0.7003918886184692), (u'faculty', 0.6780908703804016), (u'undergraduate', 0.6587098240852356)]
Find a word that is not in a list is also supported, although, implementing this by yourself is simple.
>>> model.doesnt_match('breakfast cereal dinner lunch'.split())
'cereal'
Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,
Vector 'King - Man + Woman' results close to 'Queen' or 'Germany - Berlin + Paris' closes to vector 'France'.
>>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
[(u'queen', 0.7118192911148071)]
>>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
[(u'France', 0.7884092926979065)]
We can visualize the word embeddings using t-SNE (http://lvdmaaten.github.io/tsne/). For demo, we just visualize the first 1000 words.
You can just change it to a bigger value.
import numpy as np
labels = []
count = 0
max_count = 1000
X = np.zeros(shape=(max_count,len(model['university'])))
for term in model.vocab:
X[count] = model[term]
labels.append(term)
count+= 1
if count >= max_count: break
# It is recommended to use PCA first to reduce to ~50 dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
X_50 = pca.fit_transform(X)
# Using TSNE to further reduce to 2 dimensions
from sklearn.manifold import TSNE
model_tsne = TSNE(n_components=2, random_state=0)
Y = model_tsne.fit_transform(X_50)
# Show the scatter plot
import matplotlib.pyplot as plt
plt.scatter(Y[:,0], Y[:,1], 20)
# Add labels
for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):
plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)
plt.show()
Prune the trained binary model
~~~~~~~~~~~~~~~~~
Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/
We use this code to get the `word2vec_sample` model.
import gensim
from gensim.models.word2vec import Word2Vec
# Load the binary model
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True);
# Only output word that appear in the Brown corpus
from nltk.corpus import brown
words = set(brown.words())
print (len(words))
# Output presented word to a temporary file
out_file = 'pruned.word2vec.txt'
f = open(out_file,'wb')
word_presented = words.intersection(model.vocab.keys())
f.write('{} {}\n'.format(len(word_presented),len(model['word'])))
for word in word_presented:
f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))
f.close()
# Reload the model from text file
new_model = Word2Vec.load_word2vec_format(out_file, binary=False);
# Save it as the Gensim model
gensim_model = "pruned.word2vec.bin"
new_model.save(gensim_model)
# Load the model
very_new_model = gensim.models.Word2Vec.load(gensim_model)
# Test it
very_new_model.most_similar(positive=['king','woman'], negative=['man'], topn=1)
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment