Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
N
nltk
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
edx
nltk
Commits
e514dac8
Commit
e514dac8
authored
Nov 03, 2014
by
Dmitrijs Milajevs
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Refactoring of wsd.lesk().
parent
b6767295
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
25 additions
and
49 deletions
+25
-49
nltk/test/wsd.doctest
+6
-3
nltk/wsd.py
+19
-46
No files found.
nltk/test/wsd.doctest
View file @
e514dac8
...
...
@@ -20,10 +20,13 @@ a Synset with the highest number of overlapping words between the context
sentence and different definitions from each Synset.
>>> from nltk.wsd import lesk
>>> from nltk import word_tokenize
>>> sent = word_tokenize('I went to the bank to deposit money.')
>>> sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
>>> print(lesk(sent, 'bank', 'n'))
Synset('bank.n.07')
Synset('savings_bank.n.02')
>>> print(lesk(sent, 'bank'))
Synset('savings_bank.n.02')
The definitions for "bank" are:
...
...
nltk/wsd.py
View file @
e514dac8
...
...
@@ -6,68 +6,41 @@
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
from
nltk.corpus
import
wordnet
as
wn
from
nltk.corpus
import
wordnet
############################################################
# Lesk Algorithm
############################################################
def
_compare_overlaps_greedy
(
context
,
synsets_signatures
,
pos
=
None
):
"""
Calculate overlaps between the context sentence and the synset_signature
and returns the synset with the highest overlap.
:param context: ``context_sentence`` The context sentence where the ambiguous word occurs.
:param iter synsets_signatures: An iterable of pairs of sysnsets and their definiitons.
:param pos: ``pos`` A specified Part-of-Speech (POS).
:return: ``lesk_sense`` The Synset() object with the highest signature overlaps.
"""
max_overlaps
=
0
lesk_sense
=
None
context
=
set
(
context
)
for
ss
,
definition
in
synsets_signatures
:
if
pos
and
str
(
ss
.
pos
())
==
pos
:
# Skips different POS.
overlaps
=
len
(
context
.
intersection
(
definition
))
if
overlaps
>
max_overlaps
:
lesk_sense
=
ss
max_overlaps
=
overlaps
return
lesk_sense
def
lesk
(
context_sentence
,
ambiguous_word
,
pos
=
None
,
dictionary
=
None
):
"""
This function is the implementation of the original Lesk algorithm (1986) [1].
It requires a dictionary which contains the definition of the different
sense of each word.
>>> from nltk import word_tokenize
>>> sent = word_tokenize("I went to the bank to deposit money.")
>>> lesk(sent, 'bank', 'n')
Synset('bank.n.07')
def
lesk
(
context_sentence
,
ambiguous_word
,
pos
=
None
,
synsets
=
None
):
"""Return a synset for an ambigous word.
:param context_sentence: The context sentence where the ambiguous word occurs.
:param ambiguous_word: The ambiguous word that requires WSD.
:param pos: A specified Part-of-Speech (POS).
:param
dict dictionary: A mapping of synsets to their definitions
.
:param
iter sysnsets: Possible synsets of the ambiguous word
.
:return: ``lesk_sense`` The Synset() object with the highest signature overlaps.
This function is an implementation of the original Lesk algorithm (1986) [1].
Usage example::
>>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n')
Synset('savings_bank.n.02')
[1] Lesk, Michael. "Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone." Proceedings
of the 5th annual international conference on Systems documentation. ACM,
1986. http://dl.acm.org/citation.cfm?id=318728
"""
if
not
dictionary
:
dictionary
=
dict
((
ss
,
ss
.
definition
()
.
split
())
for
ss
in
wn
.
synsets
(
ambiguous_word
))
context
=
set
(
context_sentence
)
if
not
synsets
:
synsets
=
wordnet
.
synsets
(
ambiguous_word
)
dictionary_items
=
sorted
(
dictionary
.
items
())
_
,
sense
=
max
(
(
len
(
context
.
intersection
(
ss
.
definition
()
.
split
())),
ss
)
for
ss
in
synsets
if
pos
is
None
or
str
(
ss
.
pos
())
==
pos
)
return
_compare_overlaps_greedy
(
context_sentence
,
dictionary_items
,
pos
)
return
sense
if
__name__
==
"__main__"
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment