forked from kristjan/cjvt-srl-tagging
added mate-tools srl parser
This commit is contained in:
83
tools/srl-20131216/scripts/README
Normal file
83
tools/srl-20131216/scripts/README
Normal file
@@ -0,0 +1,83 @@
|
||||
This folder contains some scripts to execute the system. They all need
|
||||
to be edited slightly before use with proper paths to corpora and/or
|
||||
models. They are meant to be executed from the parent directory, e.g.
|
||||
|
||||
cd $DIST_ROOT
|
||||
sh scripts/learn.sh
|
||||
|
||||
There are some comments in the script on switches etc. The amount of
|
||||
memory used is typically whats required for the English CoNLL 2009
|
||||
corpora. It might be possible to push it down a bit.
|
||||
|
||||
|
||||
Since the system grew out of the CoNLL 2009 ST, there are a couple of
|
||||
different ways to parse a corpus:
|
||||
|
||||
(i) parse_full.sh - parses a complete corpus using all steps of the
|
||||
pipeline except tokenization. It pretty much
|
||||
assumes that that the file contains tokens in the
|
||||
second column and disregards the rest.
|
||||
If the -nopi switch is used, it needs to have the
|
||||
IsPred column from the CoNLL 2009 data format.
|
||||
|
||||
(ii) parse_srl_only.sh - parses semantic roles only. The input is
|
||||
expected to be the CoNLL 2009 data format
|
||||
with proper dependency trees (ie. the
|
||||
SRLonly evaluation corpus).
|
||||
In order to replicate the setting of the 2009
|
||||
ST, one can use the -nopi switch to skip the
|
||||
predicate identification step.
|
||||
|
||||
|
||||
Then there is also the HTTP interface. This is started by the
|
||||
run_http_server.sh script. Again, edit the file with proper paths
|
||||
before executing it.
|
||||
|
||||
NOTE: the http server depends on the java package
|
||||
com.sun.net.httpserver (cf.
|
||||
http://download.oracle.com/javase/6/docs/jre/api/net/httpserver
|
||||
/spec/com/sun/net/httpserver/package-summary.html
|
||||
and
|
||||
http://blogs.sun.com/michaelmcm/entry/http_server_api_in_java
|
||||
), which is not part of the real Java specification, but comes with
|
||||
most (or at least some) JRE distributions. From my own experience,
|
||||
it is included in the Sun Java 6 distribution* as well as the OpenJDK
|
||||
Java 6**.
|
||||
|
||||
[[
|
||||
*:
|
||||
On a Mac:
|
||||
% java -version
|
||||
java version "1.6.0_17"
|
||||
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
|
||||
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
|
||||
%
|
||||
|
||||
and
|
||||
On pc, mandriva linux:
|
||||
% /usr/lib/jvm/java-sun/bin/java -version
|
||||
java version "1.6.0_15"
|
||||
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
|
||||
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
|
||||
%
|
||||
|
||||
**:
|
||||
On pc, mandriva linux:
|
||||
% java -version
|
||||
java version "1.6.0_18"
|
||||
OpenJDK Runtime Environment (IcedTea6 1.8) (mandriva-2.b18.2mdv2009.1-x86_64)
|
||||
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
|
||||
%
|
||||
]]
|
||||
|
||||
The graphical dependency graph output of the HTTP interface relies on
|
||||
having Chinese fonts install properly locally. On Linux I had some issues
|
||||
with this, but resolved them according to
|
||||
http://blog.lizhao.net/2007/03/java-chinese-fonts-on-ubuntu.html
|
||||
|
||||
The graphcical dependency graph output also seems to work less good
|
||||
when using OpenJDK. The images get strange lines through them. The Sun
|
||||
JRE seems to work fine though.
|
||||
|
||||
|
||||
Feedback and questions are appreciated: anders@ims.uni-stuttgart.de
|
||||
38
tools/srl-20131216/scripts/learn.sh
Normal file
38
tools/srl-20131216/scripts/learn.sh
Normal file
@@ -0,0 +1,38 @@
|
||||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
## For further information on switches, see the source code, or run
|
||||
## java -cp srl.jar se.lth.cs.srl.Learn --help
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
CORPUS=~/corpora/conll09/spa/CoNLL2009-ST-Spanish-train.txt.pdeps #training corpus
|
||||
Lang="spa"
|
||||
MODEL="srl-$Lang.model"
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific java binary.
|
||||
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
CP="srl.jar:lib/liblinear-1.51-with-deps.jar"
|
||||
JVM_ARGS="-cp $CP -Xmx$MEM"
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#LLBINARY="-llbinary /home/anders/liblinear-1.6/train" #Path to locally compiled liblinear. Uncomment this and correct the path if you have it. This will make training models faster (30-40%). The models come out slightly differently compared to the java version though due to floating point arithmetics.
|
||||
#RERANKER="-reranker" #Uncomment this if you want to train a reranker too. This takes about 8 times longer than the simple pipeline.
|
||||
|
||||
|
||||
#Execute
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Learn $Lang $CORPUS $MODEL $RERANKER $LLBINARY"
|
||||
echo "Executing: $CMD"
|
||||
|
||||
$CMD
|
||||
|
||||
59
tools/srl-20131216/scripts/parse_full.sh
Normal file
59
tools/srl-20131216/scripts/parse_full.sh
Normal file
@@ -0,0 +1,59 @@
|
||||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
## For further information on switches, see the source code, or run
|
||||
## java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
#INPUT="/home/anders/corpora/conll09/eng/CoNLL2009-evaluation-English-SRLonly.txt" #evaluation corpus
|
||||
INPUT=/home/anders/corpora/conll09/chi/CoNLL2009-ST-evaluation-Chinese-SRLonly.txt
|
||||
LANG="chi"
|
||||
##TOKENIZER_MODEL="models/eng/EnglishTok.bin.gz" #This is not used here anyway. The input is assumed to be segmented/tokenized already.
|
||||
##LEMMATIZER_MODEL="models/chi/lemma-eng.model"
|
||||
POS_MODEL="models/chi/tag-chn.model"
|
||||
#MORPH_MODEL="models/ger/morph-ger.model" #Morphological tagger is not applicable to English. Fix the path and uncomment if you are running german.
|
||||
PARSER_MODEL="models/chi/prs-chn.model"
|
||||
SRL_MODEL="models/chi/srl-chn.model"
|
||||
OUTPUT="$LANG.out"
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific JRE.
|
||||
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
CP="srl.jar:lib/anna.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-1.4.3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar"
|
||||
JVM_ARGS="-cp $CP -Xmx$MEM"
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
|
||||
#NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step.
|
||||
|
||||
|
||||
|
||||
##################################################
|
||||
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.CompletePipeline $LANG $NOPI $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -test $INPUT -out $OUTPUT"
|
||||
|
||||
if [ "$TOKENIZER_MODEL" != "" ]; then
|
||||
CMD="$CMD -token $TOKENIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$LEMMATIZER_MODEL" != "" ]; then
|
||||
CMD="$CMD -lemma $LEMMATIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$MORPH_MODEL" != "" ]; then
|
||||
CMD="$CMD -morph $MORPH_MODEL"
|
||||
fi
|
||||
|
||||
echo "Executing: $CMD"
|
||||
|
||||
$CMD
|
||||
37
tools/srl-20131216/scripts/parse_srl_only.sh
Normal file
37
tools/srl-20131216/scripts/parse_srl_only.sh
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
## For further information on switches, see the source code, or run
|
||||
## java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
|
||||
Lang="spa"
|
||||
MODEL="./srl-spa.model"
|
||||
OUTPUT="${Lang}-eval.out"
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific java binary.
|
||||
MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
|
||||
JVM_ARGS="-cp $CP -Xmx$MEM"
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
|
||||
NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
|
||||
|
||||
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
|
||||
echo "Executing: $CMD"
|
||||
|
||||
$CMD
|
||||
57
tools/srl-20131216/scripts/run_anna_http_server.sh
Normal file
57
tools/srl-20131216/scripts/run_anna_http_server.sh
Normal file
@@ -0,0 +1,57 @@
|
||||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
Lang="eng"
|
||||
MODELDIR=`dirname $0`/../../models/eng/
|
||||
#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
|
||||
#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data" #Use this for chinese.
|
||||
LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
|
||||
POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
|
||||
#MORPH_MODEL=${MODELDIR}/ #No morph model for English.
|
||||
PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
|
||||
|
||||
PORT=8073 #The port to listen on
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific java binary.
|
||||
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
DIST_ROOT=`dirname $0`/..
|
||||
CP=${DIST_ROOT}/srl.jar
|
||||
for jar in ${DIST_ROOT}/lib/*.jar; do
|
||||
# echo $jar
|
||||
CP=${CP}:$jar
|
||||
done
|
||||
#exit 0;
|
||||
JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
|
||||
# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
|
||||
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.AnnaHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -port $PORT"
|
||||
|
||||
if [ "$TOKENIZER_MODEL" != "" ]; then
|
||||
CMD=${CMD}" -token $TOKENIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$LEMMATIZER_MODEL" != "" ]; then
|
||||
CMD="$CMD -lemma $LEMMATIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$MORPH_MODEL" != "" ]; then
|
||||
CMD="$CMD -morph $MORPH_MODEL"
|
||||
fi
|
||||
|
||||
echo "Executing: $CMD"
|
||||
$CMD
|
||||
61
tools/srl-20131216/scripts/run_http_server.sh
Normal file
61
tools/srl-20131216/scripts/run_http_server.sh
Normal file
@@ -0,0 +1,61 @@
|
||||
#!/bin/sh
|
||||
|
||||
## There are three sets of options that need, may need to, and could be changed.
|
||||
## (1) deals with input and output. You have to set these (in particular, you need to provide models)
|
||||
## (2) deals with the jvm parameters and may need to be changed
|
||||
## (3) deals with the behaviour of the system
|
||||
|
||||
## For further information on switches, see the source code, or run
|
||||
## java -cp srl-20100902.jar se.lth.cs.srl.http.HttpPipeline
|
||||
|
||||
##################################################
|
||||
## (1) The following needs to be set appropriately
|
||||
##################################################
|
||||
Lang="eng"
|
||||
MODELDIR=`dirname $0`/../../models/eng/
|
||||
#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
|
||||
#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data" #Use this for chinese.
|
||||
LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
|
||||
POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
|
||||
#MORPH_MODEL=${MODELDIR}/ #No morph model for English.
|
||||
PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
|
||||
SRL_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.srl.model
|
||||
|
||||
PORT=8072 #The port to listen on
|
||||
|
||||
##################################################
|
||||
## (2) These ones may need to be changed
|
||||
##################################################
|
||||
JAVA="java" #Edit this i you want to use a specific java binary.
|
||||
MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
|
||||
DIST_ROOT=`dirname $0`/..
|
||||
CP=${DIST_ROOT}/srl.jar
|
||||
for jar in ${DIST_ROOT}/lib/*.jar; do
|
||||
# echo $jar
|
||||
CP=${CP}:$jar
|
||||
done
|
||||
#exit 0;
|
||||
JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
|
||||
# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
|
||||
|
||||
##################################################
|
||||
## (3) The following changes the behaviour of the system
|
||||
##################################################
|
||||
#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
|
||||
|
||||
CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.SRLHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -port $PORT"
|
||||
|
||||
if [ "$TOKENIZER_MODEL" != "" ]; then
|
||||
CMD=${CMD}" -token $TOKENIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$LEMMATIZER_MODEL" != "" ]; then
|
||||
CMD="$CMD -lemma $LEMMATIZER_MODEL"
|
||||
fi
|
||||
|
||||
if [ "$MORPH_MODEL" != "" ]; then
|
||||
CMD="$CMD -morph $MORPH_MODEL"
|
||||
fi
|
||||
|
||||
echo "Executing: $CMD"
|
||||
$CMD
|
||||
Reference in New Issue
Block a user