SPMRL 2014 software and data
These are the settings, programs and scripts we used for our paper "Parsing Morphologically Rich Languages with (Mostly) Off-The-Shelf Software and Word Vectors".
If you have questions regarding the setup, send me an email.
The Word Vectors
The unlabeled data was converted into "one sentence per line" format:
cleanup.py.txt (remove the .txt, it't an artifact of this wiki)
The word vectors were created as follows:
./word2vec -train $inputfile -output ${vector}-200-cbow.vectors -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 0
./word2vec -train $inputfile -output ${vector}-200-skipgram-5neg.vectors -cbow 0 -size 200 -window 5 -negative 1 -hs 1 -sample 1e-3 -threads 12 -binary 0
The 400 dimension word vector combinations were created with this script:
make_composite_vectors.sh
If you are interested in the word vectors we used, please send me an e-mail. Please note that they are about 4 Gigs in total.
The Parsers
TurboParser was trained as follows:
/path/to/TurboParser/TurboParser -train -file_train $inputfile -file_model $modelfile
The Mate parser as follows:
java -Xmx25000M -cp anna-3.61.jar is2.parser.Parser -i 20 -train train5k.$language.gold.conll9 -model mate-$language.model
RBGParser as follows:
java -classpath "bin:lib/trove.jar" -Xmx35000m parser.DependencyParser train train-file:../'${language^^}_SPMRL/gold/conll/train5k/train5k.$language.gold.conll model-file:$language-$vector.model thread:8 label:true model:full word-vector:../Unlabeled.$language.pred.-$vector.vectors
The Relabeler
The relabeler was uses
megam.
The Lattice Chooser
(remove the .txt, it't an artifact of this wiki)