forked from kristjan/cjvt-srl-tagging
added some data and updated README.md
This commit is contained in:
parent
2755f4c7ed
commit
0a27168ddf
3
.gitmodules
vendored
Normal file
3
.gitmodules
vendored
Normal file
|
@ -0,0 +1,3 @@
|
|||
[submodule "bilateral-srl"]
|
||||
path = bilateral-srl
|
||||
url = https://github.com/clarinsi/bilateral-srl.git
|
90
README.md
90
README.md
|
@ -1,9 +1,95 @@
|
|||
# cjvt-srl-tagging
|
||||
We'll be using mate-tools to perform SRL on Kres.
|
||||
We'll be using mate-tools to perform SRL on Kres.
|
||||
|
||||
## workspace
|
||||
The tools require Java.
|
||||
See `./dockerfiles/mate-tool-env/README.md` for environment preparation.
|
||||
|
||||
## mate-tools
|
||||
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
|
||||
Benchmarking the tool for slo and hr: [2].
|
||||
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
|
||||
|
||||
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
|
||||
|
||||
### train
|
||||
Create the `model-file`:
|
||||
|
||||
`--help` output:
|
||||
```bash
|
||||
java -cp srl.jar se.lth.cs.srl.Learn --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
|
||||
|
||||
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Learning-specific options:
|
||||
-fdir <dir> the directory with feature files (see below)
|
||||
-reranker trains a reranker also (not done by default)
|
||||
-llbinary <file> a reference to a precompiled version of liblinear,
|
||||
makes training much faster than the java version.
|
||||
-partitions <int> number of partitions used for the reranker
|
||||
-dontInsertGold don't insert the gold standard proposition during
|
||||
training of the reranker.
|
||||
-skipUnknownPredicates skips predicates not matching any POS-tags from
|
||||
the feature files.
|
||||
-dontDeleteTrainData doesn't delete the temporary files from training
|
||||
on exit. (For debug purposes)
|
||||
-ndPipeline Causes the training data and feature mappings to be
|
||||
derived in a non-deterministic way. I.e. training the pipeline
|
||||
on the same corpus twice does not yield the exact same models.
|
||||
This is however slightly faster.
|
||||
|
||||
The feature file dir needs to contain four files with feature sets. See
|
||||
the website for further documentation. The files are called
|
||||
pi.feats, pd.feats, ai.feats, and ac.feats
|
||||
All need to be in the feature file dir, otherwise you will get an error.
|
||||
```
|
||||
Input: `lang`, `input-corpus`.
|
||||
|
||||
### parse
|
||||
`--help` output:
|
||||
```bash
|
||||
$ java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
|
||||
|
||||
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Parsing-specific options:
|
||||
-nopi skips the predicate identification. This is equivalent to the
|
||||
setting in the CoNLL 2009 ST.
|
||||
-reranker uses a reranker (assumed to be included in the model)
|
||||
-alfa <double> the alfa used by the reranker. (default 1.0)
|
||||
```
|
||||
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
|
||||
|
||||
### input data:
|
||||
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
|
||||
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
|
||||
|
||||
## Sources
|
||||
[1] (mate-tools) https://code.google.com/archive/p/mate-tools/
|
||||
|
|
1
bilateral-srl
Submodule
1
bilateral-srl
Submodule
|
@ -0,0 +1 @@
|
|||
Subproject commit 86642e18665ce5304f57b3e38ef18057807eaae8
|
107152
data/sl.all.mate
Normal file
107152
data/sl.all.mate
Normal file
File diff suppressed because it is too large
Load Diff
21409
data/sl.test.mate
Normal file
21409
data/sl.test.mate
Normal file
File diff suppressed because it is too large
Load Diff
85743
data/sl.train.mate
Normal file
85743
data/sl.train.mate
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -1 +1,5 @@
|
|||
FROM java
|
||||
|
||||
RUN apt-get update
|
||||
RUN apt-get install -y \
|
||||
vim
|
||||
|
|
37
tools/srl-20131216/scripts/parse_srl_only_mod.sh
Normal file
37
tools/srl-20131216/scripts/parse_srl_only_mod.sh
Normal file
|
@ -0,0 +1,37 @@
|
|||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
## For further information on switches, see the source code, or run
|
||||
## java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
|
||||
Lang="spa"
|
||||
MODEL="./srl-spa.model"
|
||||
OUTPUT="${Lang}-eval.out"
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific java binary.
|
||||
MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
|
||||
JVM_ARGS="-cp $CP -Xmx$MEM"
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
|
||||
NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
|
||||
|
||||
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
|
||||
echo "Executing: $CMD"
|
||||
|
||||
$CMD
|
Loading…
Reference in New Issue
Block a user