forked from kristjan/cjvt-srl-tagging
		
	
		
			
				
	
	
		
			98 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			98 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# cjvt-srl-tagging
 | 
						|
We'll be using mate-tools to perform SRL on Kres. 
 | 
						|
 | 
						|
## workspace
 | 
						|
The tools require Java. 
 | 
						|
See `./dockerfiles/mate-tool-env/README.md` for environment preparation. 
 | 
						|
 | 
						|
## mate-tools
 | 
						|
Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
 | 
						|
Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
 | 
						|
 | 
						|
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
 | 
						|
 | 
						|
### train
 | 
						|
Create the `model-file`:
 | 
						|
 | 
						|
`--help` output:
 | 
						|
```bash
 | 
						|
java -cp srl.jar se.lth.cs.srl.Learn --help
 | 
						|
Not enough arguments, aborting.
 | 
						|
Usage:
 | 
						|
 java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
 | 
						|
 | 
						|
Example:
 | 
						|
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
 | 
						|
 | 
						|
 trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
 | 
						|
 | 
						|
<lang> corresponds to the language and is one of
 | 
						|
 chi, eng, ger
 | 
						|
 | 
						|
Options:
 | 
						|
 -aibeam <int>     the size of the ai-beam for the reranker
 | 
						|
 -acbeam <int>     the size of the ac-beam for the reranker
 | 
						|
 -help             prints this message
 | 
						|
 | 
						|
Learning-specific options:
 | 
						|
 -fdir <dir>             the directory with feature files (see below)
 | 
						|
 -reranker               trains a reranker also (not done by default)
 | 
						|
 -llbinary <file>        a reference to a precompiled version of liblinear,
 | 
						|
                         makes training much faster than the java version.
 | 
						|
 -partitions <int>       number of partitions used for the reranker
 | 
						|
 -dontInsertGold         don't insert the gold standard proposition during
 | 
						|
                         training of the reranker.
 | 
						|
 -skipUnknownPredicates  skips predicates not matching any POS-tags from
 | 
						|
                         the feature files.
 | 
						|
 -dontDeleteTrainData    doesn't delete the temporary files from training
 | 
						|
                         on exit. (For debug purposes)
 | 
						|
 -ndPipeline             Causes the training data and feature mappings to be
 | 
						|
                         derived in a non-deterministic way. I.e. training the pipeline
 | 
						|
                         on the same corpus twice does not yield the exact same models.
 | 
						|
                         This is however slightly faster.
 | 
						|
 | 
						|
The feature file dir needs to contain four files with feature sets. See
 | 
						|
the website for further documentation. The files are called
 | 
						|
pi.feats, pd.feats, ai.feats, and ac.feats
 | 
						|
All need to be in the feature file dir, otherwise you will get an error.
 | 
						|
```
 | 
						|
Input: `lang`, `input-corpus`. 
 | 
						|
 | 
						|
### parse
 | 
						|
`--help` output:
 | 
						|
```bash
 | 
						|
$ java -cp srl.jar se.lth.cs.srl.Parse --help
 | 
						|
Not enough arguments, aborting.
 | 
						|
Usage:
 | 
						|
 java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
 | 
						|
 | 
						|
Example:
 | 
						|
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
 | 
						|
 | 
						|
 parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
 | 
						|
 | 
						|
<lang> corresponds to the language and is one of
 | 
						|
 chi, eng, ger
 | 
						|
 | 
						|
Options:
 | 
						|
 -aibeam <int>     the size of the ai-beam for the reranker
 | 
						|
 -acbeam <int>     the size of the ac-beam for the reranker
 | 
						|
 -help             prints this message
 | 
						|
 | 
						|
Parsing-specific options:
 | 
						|
 -nopi           skips the predicate identification. This is equivalent to the
 | 
						|
                 setting in the CoNLL 2009 ST.
 | 
						|
 -reranker       uses a reranker (assumed to be included in the model)
 | 
						|
 -alfa <double>  the alfa used by the reranker. (default 1.0)
 | 
						|
```
 | 
						|
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
 | 
						|
 | 
						|
### input data:
 | 
						|
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
 | 
						|
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
 | 
						|
 | 
						|
## Sources
 | 
						|
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
 | 
						|
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
 | 
						|
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
 | 
						|
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll |