The EBMT System: LIN-EBMTREC+
<<<Back
NOTE: In case you use any information or files on this page, please cite
Monica Gavrila, Improving Recombination in a Linear EBMT System by Use of Constraints, Doctoral Thesis, 2012, University of Hamburg
Preprocessing the data
To have an overview of the data format, please read the thesis. In general the pre-processing steps follow the ones of a
Moses-based SMT System: lowercasing, tokenization, etc.
In order to start the system, some more preprocessed data is needed.
Example
To get the corpus file in the needed format please run the following java file:
CorpusCreation1.java
- The language-model(LM) file
Example
To get the corpus file in the needed format please run the following java file:
ExtractLM1.java
It is needed also
MyIndex1.java
- An index file for the source language (SL)
Example
To get the corpus file in the needed format please run the following java file: *
Index1.java:
It is needed also
IndexCreation1.java and
MyIndex1.java
Referenced libraries for parsing XML:
jdom.jar (needed minimum version 1.0)
The system
To run the system you should provide the following parameters when starting:
java -jar linebmtrecplus.jar PATH/NAME_PARAMETER_FILE
The JAR file is here
linebmtrecplus.jar
Before running the program, please give the following parameters in a file (each parameter on one line)
- Line with some text: for example Parameters
- The corpus: = PATH/NAME.xml
- The LM-File = PATH/NAME.lm
- Input data file (same format as the data for a Moses-based SMT System) = PATH/NAME.input
- Output file (same format as in Moses) = PATH/NAME.output
- An SL Index file = PATH/NAME.xml
- The word-alignment Information (The same as in a Moses-based SMT System) PATH/NAME.A3.final (from the two GIZA++ alignments only the one with the name TL-SL.A3.final is needed)
- A LOG file = PATH/NAME.txt
- A file where templates will be saved = PATH/NAME.txt
- The source language = NAME (two letter code, for example en=English)
- The target language = NAME (two letter code, for example en=English) (Further examples for the languages de, ro, en, etc. The names for the source and target language should be the ones which are used in the whole data: for example as the tags in the corpus)
Put in the same folder with the .JAR file also the JDOM .jar file, which should be named jdom.jar.
Examples of parameters:
TEST DATA
corpus/corpus-en-ro.xml
extras.languageModel.lm
evaluation/roger.input
evaluation/Roger2.output
extras/index-en.xml
extras/ro-en.A3.final
evaluation/logtest.txt
evaluation/templtest.txt
en
ro
Using LIN-EBMTREC+
* The System is used in the
ATLAS Project