cjvt-srl-tagging

We'll be using mate-tools to perform SRL on Kres.

workspace

The tools require Java. See ./dockerfiles/mate-tool-env/README.md for environment preparation.

mate-tools

Using Full srl pipeline (including anna-3.3) from the Downloads section. Benchmarking the tool for slo and hr: [2] (submodule of this repo).

Mate-tool for srl tagging can be found in ./tools/srl-20131216/.

train

Create the model-file:

--help output:

java -cp srl.jar se.lth.cs.srl.Learn --help
Not enough arguments, aborting.
Usage:
 java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]

Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train

 trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl

<lang> corresponds to the language and is one of
 chi, eng, ger

Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message

Learning-specific options:
 -fdir <dir>             the directory with feature files (see below)
 -reranker               trains a reranker also (not done by default)
 -llbinary <file>        a reference to a precompiled version of liblinear,
                         makes training much faster than the java version.
 -partitions <int>       number of partitions used for the reranker
 -dontInsertGold         don't insert the gold standard proposition during
                         training of the reranker.
 -skipUnknownPredicates  skips predicates not matching any POS-tags from
                         the feature files.
 -dontDeleteTrainData    doesn't delete the temporary files from training
                         on exit. (For debug purposes)
 -ndPipeline             Causes the training data and feature mappings to be
                         derived in a non-deterministic way. I.e. training the pipeline
                         on the same corpus twice does not yield the exact same models.
                         This is however slightly faster.

The feature file dir needs to contain four files with feature sets. See
the website for further documentation. The files are called
pi.feats, pd.feats, ai.feats, and ac.feats
All need to be in the feature file dir, otherwise you will get an error.

Input: lang, input-corpus.

parse

--help output:

$ java -cp srl.jar se.lth.cs.srl.Parse --help
Not enough arguments, aborting.
Usage:
 java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>

Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out

 parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step

<lang> corresponds to the language and is one of
 chi, eng, ger

Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message

Parsing-specific options:
 -nopi           skips the predicate identification. This is equivalent to the
                 setting in the CoNLL 2009 ST.
 -reranker       uses a reranker (assumed to be included in the model)
 -alfa <double>  the alfa used by the reranker. (default 1.0)

We need to provide lang (ger for German feature functions?), input-corpus and model (see train).

input data:

ssj500k data found in ./bilateral-srl/data/sl/sl.{test,train}; formatted for mate-tools usage in ./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate (line counts match);

Sources

[1] (mate-tools) https://code.google.com/archive/p/mate-tools/
[2] (benchmarking) https://github.com/clarinsi/bilateral-srl

4.0 KiB Raw Blame History