added mate-tools srl parser

This commit is contained in:
voje
2019-01-25 07:29:52 +01:00
parent e19d5f4599
commit 587d2d9fdd
38 changed files with 1670 additions and 1 deletions

View File

@@ -0,0 +1,83 @@
This folder contains some scripts to execute the system. They all need
to be edited slightly before use with proper paths to corpora and/or
models. They are meant to be executed from the parent directory, e.g.
cd $DIST_ROOT
sh scripts/learn.sh
There are some comments in the script on switches etc. The amount of
memory used is typically whats required for the English CoNLL 2009
corpora. It might be possible to push it down a bit.
Since the system grew out of the CoNLL 2009 ST, there are a couple of
different ways to parse a corpus:
(i) parse_full.sh - parses a complete corpus using all steps of the
pipeline except tokenization. It pretty much
assumes that that the file contains tokens in the
second column and disregards the rest.
If the -nopi switch is used, it needs to have the
IsPred column from the CoNLL 2009 data format.
(ii) parse_srl_only.sh - parses semantic roles only. The input is
expected to be the CoNLL 2009 data format
with proper dependency trees (ie. the
SRLonly evaluation corpus).
In order to replicate the setting of the 2009
ST, one can use the -nopi switch to skip the
predicate identification step.
Then there is also the HTTP interface. This is started by the
run_http_server.sh script. Again, edit the file with proper paths
before executing it.
NOTE: the http server depends on the java package
com.sun.net.httpserver (cf.
http://download.oracle.com/javase/6/docs/jre/api/net/httpserver
/spec/com/sun/net/httpserver/package-summary.html
and
http://blogs.sun.com/michaelmcm/entry/http_server_api_in_java
), which is not part of the real Java specification, but comes with
most (or at least some) JRE distributions. From my own experience,
it is included in the Sun Java 6 distribution* as well as the OpenJDK
Java 6**.
[[
*:
On a Mac:
% java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
%
and
On pc, mandriva linux:
% /usr/lib/jvm/java-sun/bin/java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
%
**:
On pc, mandriva linux:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (mandriva-2.b18.2mdv2009.1-x86_64)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
%
]]
The graphical dependency graph output of the HTTP interface relies on
having Chinese fonts install properly locally. On Linux I had some issues
with this, but resolved them according to
http://blog.lizhao.net/2007/03/java-chinese-fonts-on-ubuntu.html
The graphcical dependency graph output also seems to work less good
when using OpenJDK. The images get strange lines through them. The Sun
JRE seems to work fine though.
Feedback and questions are appreciated: anders@ims.uni-stuttgart.de

View File

@@ -0,0 +1,38 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
## For further information on switches, see the source code, or run
## java -cp srl.jar se.lth.cs.srl.Learn --help
##################################################
## (1) The following needs to be set appropriately
##################################################
CORPUS=~/corpora/conll09/spa/CoNLL2009-ST-Spanish-train.txt.pdeps #training corpus
Lang="spa"
MODEL="srl-$Lang.model"
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific java binary.
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
CP="srl.jar:lib/liblinear-1.51-with-deps.jar"
JVM_ARGS="-cp $CP -Xmx$MEM"
##################################################
## (3) The following changes the behaviour of the system
##################################################
#LLBINARY="-llbinary /home/anders/liblinear-1.6/train" #Path to locally compiled liblinear. Uncomment this and correct the path if you have it. This will make training models faster (30-40%). The models come out slightly differently compared to the java version though due to floating point arithmetics.
#RERANKER="-reranker" #Uncomment this if you want to train a reranker too. This takes about 8 times longer than the simple pipeline.
#Execute
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Learn $Lang $CORPUS $MODEL $RERANKER $LLBINARY"
echo "Executing: $CMD"
$CMD

View File

@@ -0,0 +1,59 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
## For further information on switches, see the source code, or run
## java -cp srl.jar se.lth.cs.srl.Parse --help
##################################################
## (1) The following needs to be set appropriately
##################################################
#INPUT="/home/anders/corpora/conll09/eng/CoNLL2009-evaluation-English-SRLonly.txt" #evaluation corpus
INPUT=/home/anders/corpora/conll09/chi/CoNLL2009-ST-evaluation-Chinese-SRLonly.txt
LANG="chi"
##TOKENIZER_MODEL="models/eng/EnglishTok.bin.gz" #This is not used here anyway. The input is assumed to be segmented/tokenized already.
##LEMMATIZER_MODEL="models/chi/lemma-eng.model"
POS_MODEL="models/chi/tag-chn.model"
#MORPH_MODEL="models/ger/morph-ger.model" #Morphological tagger is not applicable to English. Fix the path and uncomment if you are running german.
PARSER_MODEL="models/chi/prs-chn.model"
SRL_MODEL="models/chi/srl-chn.model"
OUTPUT="$LANG.out"
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific JRE.
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
CP="srl.jar:lib/anna.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-1.4.3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar"
JVM_ARGS="-cp $CP -Xmx$MEM"
##################################################
## (3) The following changes the behaviour of the system
##################################################
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
#NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step.
##################################################
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.CompletePipeline $LANG $NOPI $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -test $INPUT -out $OUTPUT"
if [ "$TOKENIZER_MODEL" != "" ]; then
CMD="$CMD -token $TOKENIZER_MODEL"
fi
if [ "$LEMMATIZER_MODEL" != "" ]; then
CMD="$CMD -lemma $LEMMATIZER_MODEL"
fi
if [ "$MORPH_MODEL" != "" ]; then
CMD="$CMD -morph $MORPH_MODEL"
fi
echo "Executing: $CMD"
$CMD

View File

@@ -0,0 +1,37 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
## For further information on switches, see the source code, or run
## java -cp srl.jar se.lth.cs.srl.Parse --help
##################################################
## (1) The following needs to be set appropriately
##################################################
INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
Lang="spa"
MODEL="./srl-spa.model"
OUTPUT="${Lang}-eval.out"
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific java binary.
MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
JVM_ARGS="-cp $CP -Xmx$MEM"
##################################################
## (3) The following changes the behaviour of the system
##################################################
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
echo "Executing: $CMD"
$CMD

View File

@@ -0,0 +1,57 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
##################################################
## (1) The following needs to be set appropriately
##################################################
Lang="eng"
MODELDIR=`dirname $0`/../../models/eng/
#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data" #Use this for chinese.
LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
#MORPH_MODEL=${MODELDIR}/ #No morph model for English.
PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
PORT=8073 #The port to listen on
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific java binary.
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
DIST_ROOT=`dirname $0`/..
CP=${DIST_ROOT}/srl.jar
for jar in ${DIST_ROOT}/lib/*.jar; do
# echo $jar
CP=${CP}:$jar
done
#exit 0;
JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
##################################################
## (3) The following changes the behaviour of the system
##################################################
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.AnnaHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -port $PORT"
if [ "$TOKENIZER_MODEL" != "" ]; then
CMD=${CMD}" -token $TOKENIZER_MODEL"
fi
if [ "$LEMMATIZER_MODEL" != "" ]; then
CMD="$CMD -lemma $LEMMATIZER_MODEL"
fi
if [ "$MORPH_MODEL" != "" ]; then
CMD="$CMD -morph $MORPH_MODEL"
fi
echo "Executing: $CMD"
$CMD

View File

@@ -0,0 +1,61 @@
#!/bin/sh
## There are three sets of options that need, may need to, and could be changed.
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
## (2) deals with the jvm parameters and may need to be changed
## (3) deals with the behaviour of the system
## For further information on switches, see the source code, or run
## java -cp srl-20100902.jar se.lth.cs.srl.http.HttpPipeline
##################################################
## (1) The following needs to be set appropriately
##################################################
Lang="eng"
MODELDIR=`dirname $0`/../../models/eng/
#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data" #Use this for chinese.
LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
#MORPH_MODEL=${MODELDIR}/ #No morph model for English.
PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
SRL_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.srl.model
PORT=8072 #The port to listen on
##################################################
## (2) These ones may need to be changed
##################################################
JAVA="java" #Edit this i you want to use a specific java binary.
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
DIST_ROOT=`dirname $0`/..
CP=${DIST_ROOT}/srl.jar
for jar in ${DIST_ROOT}/lib/*.jar; do
# echo $jar
CP=${CP}:$jar
done
#exit 0;
JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
##################################################
## (3) The following changes the behaviour of the system
##################################################
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.SRLHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -port $PORT"
if [ "$TOKENIZER_MODEL" != "" ]; then
CMD=${CMD}" -token $TOKENIZER_MODEL"
fi
if [ "$LEMMATIZER_MODEL" != "" ]; then
CMD="$CMD -lemma $LEMMATIZER_MODEL"
fi
if [ "$MORPH_MODEL" != "" ]; then
CMD="$CMD -morph $MORPH_MODEL"
fi
echo "Executing: $CMD"
$CMD