You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

92 lines
3.9 KiB

# mate-tools
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
## train
Create the `model-file`:
`--help` output:
```bash
java -cp srl.jar se.lth.cs.srl.Learn --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Learning-specific options:
-fdir <dir> the directory with feature files (see below)
-reranker trains a reranker also (not done by default)
-llbinary <file> a reference to a precompiled version of liblinear,
makes training much faster than the java version.
-partitions <int> number of partitions used for the reranker
-dontInsertGold don't insert the gold standard proposition during
training of the reranker.
-skipUnknownPredicates skips predicates not matching any POS-tags from
the feature files.
-dontDeleteTrainData doesn't delete the temporary files from training
on exit. (For debug purposes)
-ndPipeline Causes the training data and feature mappings to be
derived in a non-deterministic way. I.e. training the pipeline
on the same corpus twice does not yield the exact same models.
This is however slightly faster.
The feature file dir needs to contain four files with feature sets. See
the website for further documentation. The files are called
pi.feats, pd.feats, ai.feats, and ac.feats
All need to be in the feature file dir, otherwise you will get an error.
```
Input: `lang`, `input-corpus`.
### parse
`--help` output:
```bash
$ java -cp srl.jar se.lth.cs.srl.Parse --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Parsing-specific options:
-nopi skips the predicate identification. This is equivalent to the
setting in the CoNLL 2009 ST.
-reranker uses a reranker (assumed to be included in the model)
-alfa <double> the alfa used by the reranker. (default 1.0)
```
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
## input data:
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
## Sources
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll