Commit f943efed by Steven Bird

deleted stale nltk_data files

parent 0bc0f799
# Natural Language Toolkit (NLTK) Package for building models
#
# Copyright (C) 2001-2011 NLTK Project
# Authors: Steven Bird <sb@csse.unimelb.edu.au>
# Edward Loper <edloper@gradient.cis.upenn.edu>
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
"""
A package for building the models distributed with NLTK.
"""
Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
<?xml version="1.0" encoding="gb2312" ?>
<sent>
è˹
In some cases, cats were valued above humans.
</sent>
<?xml version="1.0" encoding="utf-8" ?>
<doc>
<sent>
甚至猫以人贵
In some cases, cats were valued above humans.
</sent>
</doc>
# Natural Language Toolkit (NLTK) Package for Tagset Tables
#
# Copyright (C) 2001-2011 NLTK Project
# Authors: Steven Bird <sb@csse.unimelb.edu.au>
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
"""
A package for building the Tagset Tables distributed with NLTK
"""
import re
import pickle
from tagset_data import *
RECORD_SEP = " *"
FIELD_SEP = " - "
TAGSETS = {'upenn_tagset': upenn_tagset,
'brown_tagset': brown_tagset,
'claws5_tagset': claws5_tagset}
def load_tagset(s):
tagset = {}
entries = s.split(RECORD_SEP)
for entry in entries:
if FIELD_SEP in entry:
entry = re.sub(r'(?m)\s+', ' ', entry)
_, tag, defn, examples = entry.split(FIELD_SEP, 3)
if tag not in tagset:
tagset[tag] = (defn, examples)
else:
raise ValueError, "Duplicate tag: %s" % tag
return tagset
def build_tagsets():
for tagset in TAGSETS:
print "Building", tagset
output = open(tagset + ".pickle", "w")
tagset_dict = load_tagset(TAGSETS[tagset])
pickle.dump(tagset_dict, output)
output.close()
if __name__ == '__main__':
build_tagsets()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment