* updated changelog

* fixed lots of doctest-related issues svn/trunk@8784

* updated changelog
* fixed lots of doctest-related issues svn/trunk@8784
4b802fc7 · Steven Bird · 392f099c · 4b802fc7 · 4b802fc7 · 4b802fc7
Commit 4b802fc7 authored Apr 11, 2011 by Steven Bird
9 changed files
--- a/ChangeLog
+++ b/ChangeLog
-Version 2.0.1 2011-04-??
+Version 2.0.1 (rc1) 2011-04-11

 NLTK:
 * added interface to the Stanford POS Tagger
@@ -12,7 +12,7 @@ NLTK:
 * fixed issue with NLTK's tokenize module colliding with the Python tokenize module
 * fixed issue with stemming Unicode strings
 * changed ViterbiParser.nbest_parse to parse
-* KNBC Japanese corpus reader
+* ChaSen and KNBC Japanese corpus readers
 * preserve case in concordance display
 * fixed bug in simplification of Brown tags
 * a version of IBM Model 1 as described in Koehn 2010
@@ -28,9 +28,15 @@ NLTK:
 * simplifications and corrections of Earley Chart Parser rules
 * several changes to the feature chart parsers for correct unification
 * bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
+* removal of Python >2.4 language features for 2.4 compatibility
+* removal of deprecated functions and associated warnings
+* added semantic domains to wordnet corpus reader
+* changed wordnet similarity functions to include instance hyponyms
+* updated to use latest version of Boxer

 Data:
-* Japanese corpora...
+* JEITA Public Morphologically Tagged Corpus (in ChaSen format)
+* KNB Annotated corpus of Japanese blog posts
 * Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
 * added more comtrans data

@@ -39,7 +45,7 @@ Documentation:
 * NLTK Japanese book (chapter 12) by Masato Hagiwara

 NLTK-Contrib:
-* Contribute a version of the Viethen and Dale referring expression algorithms
+* Viethen and Dale referring expression algorithms

 Thanks to the following	contributors to	2.0.1 (since 2.0b9, July 2010)
 Yonatan Becker, Steven Bethard, David Coles, Dan Garrette,

--- a/nltk/corpus/reader/bracket_parse.py
+++ b/nltk/corpus/reader/bracket_parse.py
@@ -8,7 +8,7 @@

 import sys

-from nltk.tree import bracket_parse, Tree
+from nltk.tree import Tree

 from util import *
 from api import *
@@ -75,14 +75,15 @@ class BracketParseCorpusReader(SyntaxCorpusReader):

    def _parse(self, t):
        try:
-            return bracket_parse(self._normalize(t))
+            return Tree.parse(self._normalize(t))
+
        except ValueError, e:
            sys.stderr.write("Bad tree detected; trying to recover...\n")
            # Try to recover, if we can:
            if e.args == ('mismatched parens',):
                for n in range(1, 5):
                    try:
-                        v = bracket_parse(self._normalize(t+')'*n))
+                        v = Tree.parse(self._normalize(t+')'*n))
                        sys.stderr.write("  Recovered by adding %d close "
                                         "paren(s)\n" % n)
                        return v

--- a/nltk/sem/chat80.py
+++ b/nltk/sem/chat80.py
@@ -403,7 +403,7 @@ def cities2table(filename, rel_name, dbname, verbose=False, setup=False):
        cur.close()
    except ImportError:
        import warnings
-        warnings.warn("To run this function, first install pysqlite.")
+        warnings.warn("To run this function, first install pysqlite, or else use Python 2.5 or later.")

 def sql_query(dbname, query):
    """
@@ -423,7 +423,7 @@ def sql_query(dbname, query):
        return cur.execute(query)
    except ImportError:
        import warnings
-        warnings.warn("To run this function, first install pysqlite.")
+        warnings.warn("To run this function, first install pysqlite, or else use Python 2.5 or later.")
        raise

 def _str2records(filename, rel):
@@ -780,7 +780,7 @@ def sql_demo():
            print row
    except ImportError:
        import warnings
-        warnings.warn("To run the SQL demo, first install pysqlite.")
+        warnings.warn("To run the SQL demo, first install pysqlite, or else use Python 2.5 or later.")

    
 if __name__ == '__main__':

--- a/nltk/test/ccg.doctest
+++ b/nltk/test/ccg.doctest
@@ -196,12 +196,12 @@ Note that while the two derivations are different, they are semantically equival
                                         (((S\NP)/NP)\.,((S\NP)/NP))
        -----------------------------------------------------------------------<
                                      ((S\NP)/NP)
+        ------------------------------------------------------------------------------->B
+                                          ((S\NP)/N)
                                                                                                  ------------------------------------->
                                                                                                                 (N\.,N)
                                                                                       ------------------------------------------------<
                                                                                                              N
-                                                                               -------------------------------------------------------->
-                                                                                                          NP
        ------------------------------------------------------------------------------------------------------------------------------->
                                                                    (S\NP)
    -----------------------------------------------------------------------------------------------------------------------------------<
@@ -216,12 +216,12 @@ Note that while the two derivations are different, they are semantically equival
                                         (((S\NP)/NP)\.,((S\NP)/NP))
        -----------------------------------------------------------------------<
                                      ((S\NP)/NP)
-        ------------------------------------------------------------------------------->B
-                                          ((S\NP)/N)
                                                                                                  ------------------------------------->
                                                                                                                 (N\.,N)
                                                                                       ------------------------------------------------<
                                                                                                              N
+                                                                               -------------------------------------------------------->
+                                                                                                          NP
        ------------------------------------------------------------------------------------------------------------------------------->
                                                                    (S\NP)
    -----------------------------------------------------------------------------------------------------------------------------------<

--- a/nltk/test/chat80.doctest
+++ b/nltk/test/chat80.doctest
@@ -199,9 +199,8 @@ to SQL:

 Given this grammar, we can express, and then execute, queries in English.

-    >>> from nltk.parse import load_earley
    >>> from string import join
-    >>> cp = load_earley('grammars/book_grammars/sql0.fcfg')
+    >>> cp = nltk.data.load('grammars/book_grammars/sql0.fcfg')
    >>> query = 'What cities are in China'
    >>> trees = cp.nbest_parse(query.split())
    >>> answer = trees[0].node['SEM']

--- a/nltk/test/probability.doctest
+++ b/nltk/test/probability.doctest
@@ -65,7 +65,7 @@ from the whole corpus, not just the training corpus
    >>> symbols = list(set([word for sent in corpus for (word,tag) in sent]))
    >>> print len(symbols)
    1464
-    >>> trainer = nltk.HiddenMarkovModelTrainer(tag_set, symbols) 
+    >>> trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols) 

 We divide the corpus into 90% training and 10% testing


--- a/nltk/test/tree.doctest
+++ b/nltk/test/tree.doctest
@@ -158,26 +158,26 @@ then it simply delegates to `Tree.parse()`.

 Trees can be compared for equality:

-    >>> tree == bracket_parse(str(tree))
+    >>> tree == Tree.parse(str(tree))
    True
-    >>> tree2 == bracket_parse(str(tree2))
+    >>> tree2 == Tree.parse(str(tree2))
    True
    >>> tree == tree2
    False
-    >>> tree == bracket_parse(str(tree2))
+    >>> tree == Tree.parse(str(tree2))
    False
-    >>> tree2 == bracket_parse(str(tree))
+    >>> tree2 == Tree.parse(str(tree))
    False

-    >>> tree != bracket_parse(str(tree))
+    >>> tree != Tree.parse(str(tree))
    False
-    >>> tree2 != bracket_parse(str(tree2))
+    >>> tree2 != Tree.parse(str(tree2))
    False
    >>> tree != tree2
    True
-    >>> tree != bracket_parse(str(tree2))
+    >>> tree != Tree.parse(str(tree2))
    True
-    >>> tree2 != bracket_parse(str(tree))
+    >>> tree2 != Tree.parse(str(tree))
    True
    
    >>> tree < tree2 or tree > tree2
@@ -567,7 +567,7 @@ variable:
 Define a helper funciton to create new parented trees:

    >>> def make_ptree(s):
-    ...     ptree = ParentedTree.convert(bracket_parse(s))
+    ...     ptree = ParentedTree.convert(Tree.parse(s))
    ...     all_ptrees.extend(t for t in ptree.subtrees()
    ...                       if isinstance(t, Tree))
    ...     return ptree
@@ -838,7 +838,7 @@ variable:
 Define a helper funciton to create new parented trees:

    >>> def make_mptree(s):
-    ...     mptree = MultiParentedTree.convert(bracket_parse(s))
+    ...     mptree = MultiParentedTree.convert(Tree.parse(s))
    ...     all_mptrees.extend(t for t in mptree.subtrees()
    ...                       if isinstance(t, Tree))
    ...     return mptree
@@ -1126,6 +1126,6 @@ This used to cause an infinite loop (fixed in svn 6269):

 This used to discard the ``(B b)`` subtree (fixed in svn 6270):

-    >>> print bracket_parse('((A a) (B b))')
+    >>> print Tree.parse('((A a) (B b))')
    ( (A a) (B b))

--- a/nltk/test/treetransforms.doctest
+++ b/nltk/test/treetransforms.doctest
@@ -11,7 +11,7 @@ Unit tests for the TreeTransformation class

    >>> sentence = "(TOP (S (S (VP (VBN Turned) (ADVP (RB loose)) (PP (IN in) (NP (NP (NNP Shane) (NNP Longman) (POS 's)) (NN trading) (NN room))))) (, ,) (NP (DT the) (NN yuppie) (NNS dealers)) (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right)))) (. .)))"

-    >>> tree = bracket_parse(sentence)
+    >>> tree = Tree.parse(sentence)
    >>> print tree
    (TOP
      (S

--- a/nltk/test/wordnet.doctest
+++ b/nltk/test/wordnet.doctest
@@ -171,13 +171,13 @@ The old behavior can be achieved by setting simulate_root to be False.
 A score of 1 represents identity i.e. comparing a sense with itself 
 will return 1.

-    >>> dog.path_similarity(cat)
+    >>> dog.path_similarity(cat)  # doctest: +ELLIPSIS
    0.2...

-    >>> hit.path_similarity(slap)
+    >>> hit.path_similarity(slap)  # doctest: +ELLIPSIS
    0.142...

-    >>> wn.path_similarity(hit, slap)
+    >>> wn.path_similarity(hit, slap)  # doctest: +ELLIPSIS
    0.142...

    >>> print hit.path_similarity(slap, simulate_root=False)
@@ -194,13 +194,13 @@ of the taxonomy in which the senses occur. The relationship is given
 as -log(p/2d) where p is the shortest path length and d the taxonomy
 depth.

-    >>> dog.lch_similarity(cat)
+    >>> dog.lch_similarity(cat)  # doctest: +ELLIPSIS
    2.028...

-    >>> hit.lch_similarity(slap)
+    >>> hit.lch_similarity(slap)  # doctest: +ELLIPSIS
    1.312...

-    >>> wn.lch_similarity(hit, slap)
+    >>> wn.lch_similarity(hit, slap)  # doctest: +ELLIPSIS
    1.312...

    >>> print hit.lch_similarity(slap, simulate_root=False)
@@ -225,7 +225,7 @@ shortest path to the root node is the longest will be selected. Where
 the LCS has multiple paths to the root, the longer path is used for
 the purposes of the calculation.

-    >>> dog.wup_similarity(cat)
+    >>> dog.wup_similarity(cat)  # doctest: +ELLIPSIS
    0.857...

    >>> hit.wup_similarity(slap)
@@ -263,9 +263,9 @@ information content, the result is dependent on the corpus used to
 generate the information content and the specifics of how the
 information content was created.

-    >>> dog.res_similarity(cat, brown_ic)
+    >>> dog.res_similarity(cat, brown_ic)  # doctest: +ELLIPSIS
    7.911...
-    >>> dog.res_similarity(cat, genesis_ic)
+    >>> dog.res_similarity(cat, genesis_ic)  # doctest: +ELLIPSIS
    7.204...

 ``synset1.jcn_similarity(synset2, ic):``
@@ -275,9 +275,9 @@ Information Content (IC) of the Least Common Subsumer (most specific
 ancestor node) and that of the two input Synsets. The relationship is
 given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

-    >>> dog.jcn_similarity(cat, brown_ic)
+    >>> dog.jcn_similarity(cat, brown_ic)  # doctest: +ELLIPSIS
    0.449...
-    >>> dog.jcn_similarity(cat, genesis_ic)
+    >>> dog.jcn_similarity(cat, genesis_ic)  # doctest: +ELLIPSIS
    0.285...

 ``synset1.lin_similarity(synset2, ic):``
@@ -287,7 +287,7 @@ Information Content (IC) of the Least Common Subsumer (most specific
 ancestor node) and that of the two input Synsets. The relationship is
 given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

-    >>> dog.lin_similarity(cat, semcor_ic)
+    >>> dog.lin_similarity(cat, semcor_ic)  # doctest: +ELLIPSIS
    0.886...


@@ -405,7 +405,7 @@ Bug 160: wup_similarity breaks when the two synsets have no common hypernym

    >>> t = wn.synsets('picasso')[0]
    >>> m = wn.synsets('male')[1]
-    >>> t.wup_similarity(m)
+    >>> t.wup_similarity(m)  # doctest: +ELLIPSIS
    0.631...

    >>> t = wn.synsets('titan')[1]
@@ -418,14 +418,14 @@ Bug 21: "instance of" not included in LCS (very similar to bug 160)
    >>> a = wn.synsets("writings")[0]
    >>> b = wn.synsets("scripture")[0]
    >>> brown_ic = wordnet_ic.ic('ic-brown.dat')
-    >>> a.jcn_similarity(b, brown_ic)
+    >>> a.jcn_similarity(b, brown_ic)  # doctest: +ELLIPSIS
    0.175...

 Bug 221: Verb root IC is zero

    >>> from nltk.corpus.reader.wordnet import information_content
    >>> s = wn.synsets('say', wn.VERB)[0]
-    >>> information_content(s, brown_ic)
+    >>> information_content(s, brown_ic)  # doctest: +ELLIPSIS
    4.623...

 Bug 161: Comparison between WN keys/lemmas should not be case sensitive
@@ -451,7 +451,7 @@ Bug 382: JCN Division by zero error
    >>> shlep = wn.synset('shlep.v.02')
    >>> from nltk.corpus import wordnet_ic
    >>> brown_ic =  wordnet_ic.ic('ic-brown.dat')
-    >>> tow.jcn_similarity(shlep, brown_ic)
+    >>> tow.jcn_similarity(shlep, brown_ic)  # doctest: +ELLIPSIS
    1...e+300

 Bug 428: Depth is zero for instance nouns
@@ -473,7 +473,7 @@ Bug 470: shortest_path_distance ignored instance hypernyms

    >>> google = wordnet.synsets("google")[0]
    >>> earth = wordnet.synsets("earth")[0]
-    >>> google.wup_similarity(earth)
+    >>> google.wup_similarity(earth)  # doctest: +ELLIPSIS
    0.1...

 Bug 484: similarity metrics returned -1 instead of None for no LCS
@@ -505,17 +505,17 @@ Bug 482: Some nouns not being lemmatised by WordNetLemmatizer().lemmatize

 Bug 284: instance hypernyms not used in similarity calculations

-    >>> wn.synset('john.n.02').lch_similarity(wn.synset('dog.n.01'))
+    >>> wn.synset('john.n.02').lch_similarity(wn.synset('dog.n.01'))  # doctest: +ELLIPSIS
    1.335...
-    >>> wn.synset('john.n.02').wup_similarity(wn.synset('dog.n.01'))
+    >>> wn.synset('john.n.02').wup_similarity(wn.synset('dog.n.01'))  # doctest: +ELLIPSIS
    0.571...
-    >>> wn.synset('john.n.02').res_similarity(wn.synset('dog.n.01'), brown_ic)
+    >>> wn.synset('john.n.02').res_similarity(wn.synset('dog.n.01'), brown_ic)  # doctest: +ELLIPSIS
    2.224...
-    >>> wn.synset('john.n.02').jcn_similarity(wn.synset('dog.n.01'), brown_ic)
+    >>> wn.synset('john.n.02').jcn_similarity(wn.synset('dog.n.01'), brown_ic)  # doctest: +ELLIPSIS
    0.075...
-    >>> wn.synset('john.n.02').lin_similarity(wn.synset('dog.n.01'), brown_ic)
+    >>> wn.synset('john.n.02').lin_similarity(wn.synset('dog.n.01'), brown_ic)  # doctest: +ELLIPSIS
    0.252...
-    >>> wn.synset('john.n.02').hypernym_paths()
+    >>> wn.synset('john.n.02').hypernym_paths()  # doctest: +ELLIPSIS
    [[Synset('entity.n.01'), ..., Synset('john.n.02')]]

 Issue 541: add domains to wordnet