The EBMT System: LIN-EBMTREC+

<<<Back

NOTE: In case you use any information or files on this page, please cite
Monica Gavrila, Improving Recombination in a Linear EBMT System by Use of Constraints, Doctoral Thesis, 2012, University of Hamburg

Preprocessing the data

To have an overview of the data format, please read the thesis. In general the pre-processing steps follow the ones of a Moses-based SMT System: lowercasing, tokenization, etc.

In order to start the system, some more preprocessed data is needed.

  • The corpus
Example
Index.jpg

To get the corpus file in the needed format please run the following java file: CorpusCreation1.java

  • The language-model(LM) file
Example
Index.jpg

To get the corpus file in the needed format please run the following java file: ExtractLM1.java It is needed also MyIndex1.java

  • An index file for the source language (SL)
Example
Index.jpg

To get the corpus file in the needed format please run the following java file: * Index1.java: It is needed also IndexCreation1.java and MyIndex1.java

Referenced libraries for parsing XML: jdom.jar (needed minimum version 1.0)

The system

To run the system you should provide the following parameters when starting:

java -jar linebmtrecplus.jar PATH/NAME_PARAMETER_FILE

The JAR file is here linebmtrecplus.jar

Before running the program, please give the following parameters in a file (each parameter on one line)
  • Line with some text: for example Parameters
  • The corpus: = PATH/NAME.xml
  • The LM-File = PATH/NAME.lm
  • Input data file (same format as the data for a Moses-based SMT System) = PATH/NAME.input
  • Output file (same format as in Moses) = PATH/NAME.output
  • An SL Index file = PATH/NAME.xml
  • The word-alignment Information (The same as in a Moses-based SMT System) PATH/NAME.A3.final (from the two GIZA++ alignments only the one with the name TL-SL.A3.final is needed)
  • A LOG file = PATH/NAME.txt
  • A file where templates will be saved = PATH/NAME.txt
  • The source language = NAME (two letter code, for example en=English)
  • The target language = NAME (two letter code, for example en=English) (Further examples for the languages de, ro, en, etc. The names for the source and target language should be the ones which are used in the whole data: for example as the tags in the corpus)

Put in the same folder with the .JAR file also the JDOM .jar file, which should be named jdom.jar.

Examples of parameters:

TEST DATA
corpus/corpus-en-ro.xml
extras.languageModel.lm
evaluation/roger.input
evaluation/Roger2.output
extras/index-en.xml
extras/ro-en.A3.final
evaluation/logtest.txt
evaluation/templtest.txt
en
ro

Using LIN-EBMTREC+

* The System is used in the ATLAS Project
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback