SPMRL 2014 software and data

These are the settings, programs and scripts we used for our paper "Parsing Morphologically Rich Languages with (Mostly) Off-The-Shelf Software and Word Vectors". If you have questions regarding the setup, send me an email.

The Word Vectors

The unlabeled data was converted into "one sentence per line" format: cleanup.py.txt (remove the .txt, it't an artifact of this wiki)

The word vectors were created as follows:

./word2vec -train $inputfile -output ${vector}-200-cbow.vectors -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 0
./word2vec -train $inputfile -output ${vector}-200-skipgram-5neg.vectors -cbow 0 -size 200 -window 5 -negative 1 -hs 1 -sample 1e-3 -threads 12 -binary 0

The 400 dimension word vector combinations were created with this script: make_composite_vectors.sh

If you are interested in the word vectors we used, please send me an e-mail. Please note that they are about 4 Gigs in total.

The Parsers

TurboParser was trained as follows:
/path/to/TurboParser/TurboParser -train -file_train $inputfile -file_model $modelfile

The Mate parser as follows:
java -Xmx25000M -cp anna-3.61.jar is2.parser.Parser -i 20 -train train5k.$language.gold.conll9 -model mate-$language.model

RBGParser as follows:
java -classpath "bin:lib/trove.jar" -Xmx35000m parser.DependencyParser  train train-file:../'${language^^}_SPMRL/gold/conll/train5k/train5k.$language.gold.conll model-file:$language-$vector.model thread:8 label:true model:full word-vector:../Unlabeled.$language.pred.-$vector.vectors

The Relabeler

The relabeler was uses megam.

The Lattice Chooser

(remove the .txt, it't an artifact of this wiki)

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback