added some data and updated README.md

per-file
voje 5 years ago
parent 2755f4c7ed
commit 0a27168ddf

3
.gitmodules vendored

@ -0,0 +1,3 @@
[submodule "bilateral-srl"]
path = bilateral-srl
url = https://github.com/clarinsi/bilateral-srl.git

@ -1,9 +1,95 @@
# cjvt-srl-tagging
We'll be using mate-tools to perform SRL on Kres.
## workspace
The tools require Java.
See `./dockerfiles/mate-tool-env/README.md` for environment preparation.
## mate-tools
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
Benchmarking the tool for slo and hr: [2].
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
### train
Create the `model-file`:
`--help` output:
```bash
java -cp srl.jar se.lth.cs.srl.Learn --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Learning-specific options:
-fdir <dir> the directory with feature files (see below)
-reranker trains a reranker also (not done by default)
-llbinary <file> a reference to a precompiled version of liblinear,
makes training much faster than the java version.
-partitions <int> number of partitions used for the reranker
-dontInsertGold don't insert the gold standard proposition during
training of the reranker.
-skipUnknownPredicates skips predicates not matching any POS-tags from
the feature files.
-dontDeleteTrainData doesn't delete the temporary files from training
on exit. (For debug purposes)
-ndPipeline Causes the training data and feature mappings to be
derived in a non-deterministic way. I.e. training the pipeline
on the same corpus twice does not yield the exact same models.
This is however slightly faster.
The feature file dir needs to contain four files with feature sets. See
the website for further documentation. The files are called
pi.feats, pd.feats, ai.feats, and ac.feats
All need to be in the feature file dir, otherwise you will get an error.
```
Input: `lang`, `input-corpus`.
### parse
`--help` output:
```bash
$ java -cp srl.jar se.lth.cs.srl.Parse --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Parsing-specific options:
-nopi skips the predicate identification. This is equivalent to the
setting in the CoNLL 2009 ST.
-reranker uses a reranker (assumed to be included in the model)
-alfa <double> the alfa used by the reranker. (default 1.0)
```
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
### input data:
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
## Sources
[1] (mate-tools) https://code.google.com/archive/p/mate-tools/

@ -0,0 +1 @@
Subproject commit 86642e18665ce5304f57b3e38ef18057807eaae8

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -1 +1,5 @@
FROM java
RUN apt-get update
RUN apt-get install -y \
vim

@ -0,0 +1,37 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
## For further information on switches, see the source code, or run
## java -cp srl.jar se.lth.cs.srl.Parse --help
##################################################
## (1) The following needs to be set appropriately
##################################################
INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
Lang="spa"
MODEL="./srl-spa.model"
OUTPUT="${Lang}-eval.out"
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific java binary.
MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
JVM_ARGS="-cp $CP -Xmx$MEM"
##################################################
## (3) The following changes the behaviour of the system
##################################################
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
echo "Executing: $CMD"
$CMD
Loading…
Cancel
Save