added some data and updated README.md

2019-01-29 07:55:13 +01:00 · 2019-01-29 07:55:13 +01:00 · 0a27168ddf
commit 0a27168ddf
parent 2755f4c7ed
8 changed files with 214437 additions and 2 deletions
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,3 @@
+[submodule "bilateral-srl"]
+	path = bilateral-srl
+	url = https://github.com/clarinsi/bilateral-srl.git
--- a/README.md
+++ b/README.md
@ -1,9 +1,95 @@
 # cjvt-srl-tagging
-We'll be using mate-tools to perform SRL on Kres.
+We'll be using mate-tools to perform SRL on Kres. 
+
+## workspace
+The tools require Java. 
+See `./dockerfiles/mate-tool-env/README.md` for environment preparation. 

 ## mate-tools
 Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
-Benchmarking the tool for slo and hr: [2]. 
+Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
+
+Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
+
+### train
+Create the `model-file`:
+
+`--help` output:
+```bash
+java -cp srl.jar se.lth.cs.srl.Learn --help
+Not enough arguments, aborting.
+Usage:
+ java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
+
+Example:
+ java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
+
+ trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
+
+<lang> corresponds to the language and is one of
+ chi, eng, ger
+
+Options:
+ -aibeam <int>     the size of the ai-beam for the reranker
+ -acbeam <int>     the size of the ac-beam for the reranker
+ -help             prints this message
+
+Learning-specific options:
+ -fdir <dir>             the directory with feature files (see below)
+ -reranker               trains a reranker also (not done by default)
+ -llbinary <file>        a reference to a precompiled version of liblinear,
+                         makes training much faster than the java version.
+ -partitions <int>       number of partitions used for the reranker
+ -dontInsertGold         don't insert the gold standard proposition during
+                         training of the reranker.
+ -skipUnknownPredicates  skips predicates not matching any POS-tags from
+                         the feature files.
+ -dontDeleteTrainData    doesn't delete the temporary files from training
+                         on exit. (For debug purposes)
+ -ndPipeline             Causes the training data and feature mappings to be
+                         derived in a non-deterministic way. I.e. training the pipeline
+                         on the same corpus twice does not yield the exact same models.
+                         This is however slightly faster.
+
+The feature file dir needs to contain four files with feature sets. See
+the website for further documentation. The files are called
+pi.feats, pd.feats, ai.feats, and ac.feats
+All need to be in the feature file dir, otherwise you will get an error.
+```
+Input: `lang`, `input-corpus`. 
+
+### parse
+`--help` output:
+```bash
+$ java -cp srl.jar se.lth.cs.srl.Parse --help
+Not enough arguments, aborting.
+Usage:
+ java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
+
+Example:
+ java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
+
+ parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
+
+<lang> corresponds to the language and is one of
+ chi, eng, ger
+
+Options:
+ -aibeam <int>     the size of the ai-beam for the reranker
+ -acbeam <int>     the size of the ac-beam for the reranker
+ -help             prints this message
+
+Parsing-specific options:
+ -nopi           skips the predicate identification. This is equivalent to the
+                 setting in the CoNLL 2009 ST.
+ -reranker       uses a reranker (assumed to be included in the model)
+ -alfa <double>  the alfa used by the reranker. (default 1.0)
+```
+We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
+
+### input data:
+* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
+formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);

 ## Sources
 [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
--- a/1
+++ b/1
@ -0,0 +1 @@
+Subproject commit 86642e18665ce5304f57b3e38ef18057807eaae8
--- a/data/sl.all.mate
+++ b/data/sl.all.mate
--- a/data/sl.test.mate
+++ b/data/sl.test.mate
--- a/data/sl.train.mate
+++ b/data/sl.train.mate
--- a/dockerfiles/mate-tool-env/Dockerfile
+++ b/dockerfiles/mate-tool-env/Dockerfile
@ -1 +1,5 @@
 FROM java
+
+RUN apt-get update
+RUN apt-get install -y \
+vim
--- a/tools/srl-20131216/scripts/parse_srl_only_mod.sh
+++ b/tools/srl-20131216/scripts/parse_srl_only_mod.sh
@ -0,0 +1,37 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+## For further information on switches, see the source code, or run
+## java -cp srl.jar se.lth.cs.srl.Parse --help
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
+Lang="spa"
+MODEL="./srl-spa.model"
+OUTPUT="${Lang}-eval.out"
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific java binary.
+MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
+CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
+JVM_ARGS="-cp $CP -Xmx$MEM"
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
+NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
+
+
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
+echo "Executing: $CMD"
+
+$CMD
				`@ -0,0 +1 @@`
				`Subproject commit 86642e18665ce5304f57b3e38ef18057807eaae8`