added mate-tools srl parser

2019-01-25 07:29:52 +01:00
parent e19d5f4599
commit 587d2d9fdd
38 changed files with 1670 additions and 1 deletions
@@ -0,0 +1,83 @@
+This folder contains some scripts to execute the system. They all need
+to be edited slightly before use with proper paths to corpora and/or
+models. They are meant to be executed from the parent directory, e.g.
+
+  cd $DIST_ROOT
+  sh scripts/learn.sh
+
+There are some comments in the script on switches etc. The amount of
+memory used is typically whats required for the English CoNLL 2009
+corpora. It might be possible to push it down a bit.
+
+
+Since the system grew out of the CoNLL 2009 ST, there are a couple of
+different ways to parse a corpus:
+
+(i) parse_full.sh - parses a complete corpus using all steps of the
+                    pipeline except tokenization. It pretty much
+                    assumes that that the file contains tokens in the
+                    second column and disregards the rest. 
+                    If the -nopi switch is used, it needs to have the
+                    IsPred column from the CoNLL 2009 data format.
+
+(ii) parse_srl_only.sh - parses semantic roles only. The input is
+                         expected to be the CoNLL 2009 data format
+                         with proper dependency trees (ie. the
+                         SRLonly evaluation corpus).
+                         In order to replicate the setting of the 2009
+                         ST, one can use the -nopi switch to skip the
+                         predicate identification step.
+
+
+Then there is also the HTTP interface. This is started by the
+run_http_server.sh script. Again, edit the file with proper paths
+before executing it.
+
+NOTE: the http server depends on the java package
+com.sun.net.httpserver (cf.
+http://download.oracle.com/javase/6/docs/jre/api/net/httpserver
+/spec/com/sun/net/httpserver/package-summary.html
+and
+http://blogs.sun.com/michaelmcm/entry/http_server_api_in_java
+), which is not part of the real Java specification, but comes with
+most (or at least some) JRE distributions. From my own experience,
+it is included in the Sun Java 6 distribution* as well as the OpenJDK 
+Java 6**.
+
+[[
+ *: 
+  On a Mac:
+  % java -version
+  java version "1.6.0_17"
+  Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
+  Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
+  %
+
+ and
+  On pc, mandriva linux:
+  % /usr/lib/jvm/java-sun/bin/java -version
+  java version "1.6.0_15"
+  Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
+  Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
+  %
+
+ **:
+  On pc, mandriva linux:
+  % java -version
+  java version "1.6.0_18"
+  OpenJDK Runtime Environment (IcedTea6 1.8) (mandriva-2.b18.2mdv2009.1-x86_64)
+  OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
+  %
+]]
+
+The graphical dependency graph output of the HTTP interface relies on 
+having Chinese fonts install properly locally. On Linux I had some issues
+with this, but resolved them according to
+http://blog.lizhao.net/2007/03/java-chinese-fonts-on-ubuntu.html
+
+The graphcical dependency graph output also seems to work less good 
+when using OpenJDK. The images get strange lines through them. The Sun
+JRE seems to work fine though. 
+
+
+Feedback and questions are appreciated: anders@ims.uni-stuttgart.de
@@ -0,0 +1,38 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+## For further information on switches, see the source code, or run
+## java -cp srl.jar se.lth.cs.srl.Learn --help
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+CORPUS=~/corpora/conll09/spa/CoNLL2009-ST-Spanish-train.txt.pdeps #training corpus
+Lang="spa"
+MODEL="srl-$Lang.model"
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific java binary.
+MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
+CP="srl.jar:lib/liblinear-1.51-with-deps.jar"
+JVM_ARGS="-cp $CP -Xmx$MEM"
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#LLBINARY="-llbinary /home/anders/liblinear-1.6/train" #Path to locally compiled liblinear. Uncomment this and correct the path if you have it. This will make training models faster (30-40%). The models come out slightly differently compared to the java version though due to floating point arithmetics.
+#RERANKER="-reranker" #Uncomment this if you want to train a reranker too. This takes about 8 times longer than the simple pipeline.
+
+
+#Execute
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Learn $Lang $CORPUS $MODEL $RERANKER $LLBINARY"
+echo "Executing: $CMD"
+
+$CMD
+
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide models)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+## For further information on switches, see the source code, or run
+## java -cp srl.jar se.lth.cs.srl.Parse --help
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+#INPUT="/home/anders/corpora/conll09/eng/CoNLL2009-evaluation-English-SRLonly.txt" #evaluation corpus
+INPUT=/home/anders/corpora/conll09/chi/CoNLL2009-ST-evaluation-Chinese-SRLonly.txt
+LANG="chi"
+##TOKENIZER_MODEL="models/eng/EnglishTok.bin.gz" #This is not used here anyway. The input is assumed to be segmented/tokenized already. 
+##LEMMATIZER_MODEL="models/chi/lemma-eng.model"
+POS_MODEL="models/chi/tag-chn.model"
+#MORPH_MODEL="models/ger/morph-ger.model" #Morphological tagger is not applicable to English. Fix the path and uncomment if you are running german.
+PARSER_MODEL="models/chi/prs-chn.model"
+SRL_MODEL="models/chi/srl-chn.model"
+OUTPUT="$LANG.out"
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific JRE.
+MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
+CP="srl.jar:lib/anna.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-1.4.3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar"
+JVM_ARGS="-cp $CP -Xmx$MEM"
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
+#NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step.
+
+
+
+##################################################
+
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.CompletePipeline $LANG $NOPI $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -test $INPUT -out $OUTPUT"
+
+if [ "$TOKENIZER_MODEL" != "" ]; then
+  CMD="$CMD -token $TOKENIZER_MODEL"
+fi
+
+if [ "$LEMMATIZER_MODEL" != "" ]; then
+  CMD="$CMD -lemma $LEMMATIZER_MODEL"
+fi
+
+if [ "$MORPH_MODEL" != "" ]; then
+  CMD="$CMD -morph $MORPH_MODEL"
+fi
+
+echo "Executing: $CMD"
+
+$CMD
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide a training corpus)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+## For further information on switches, see the source code, or run
+## java -cp srl.jar se.lth.cs.srl.Parse --help
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+INPUT=~/corpora/conll09/spa/CoNLL2009-ST-evaluation-Spanish-SRLonly.txt
+Lang="spa"
+MODEL="./srl-spa.model"
+OUTPUT="${Lang}-eval.out"
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific java binary.
+MEM="2g" #Memory for the JVM, might need to be increased for large corpora.
+CP="srl.jar:lib/liblinear-1.51-with-deps.jar:lib/anna.jar"
+JVM_ARGS="-cp $CP -Xmx$MEM"
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, this has to be set appropriately.
+NOPI="-nopi" #Uncomment this if you want to skip the predicate identification step. This setting is equivalent to the CoNLL 2009 ST.
+
+
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.Parse $Lang $INPUT $MODEL $RERANKER $NOPI $OUTPUT"
+echo "Executing: $CMD"
+
+$CMD
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide models)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+Lang="eng"
+MODELDIR=`dirname $0`/../../models/eng/
+#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
+#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data"  #Use this for chinese.
+LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
+POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
+#MORPH_MODEL=${MODELDIR}/  #No morph model for English.
+PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
+
+PORT=8073 #The port to listen on
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific java binary.
+MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
+DIST_ROOT=`dirname $0`/..
+CP=${DIST_ROOT}/srl.jar
+for jar in ${DIST_ROOT}/lib/*.jar; do
+#    echo $jar
+    CP=${CP}:$jar
+done
+#exit 0;
+JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
+# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
+
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.AnnaHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -port $PORT"
+
+if [ "$TOKENIZER_MODEL" != "" ]; then
+    CMD=${CMD}" -token $TOKENIZER_MODEL"
+fi
+
+if [ "$LEMMATIZER_MODEL" != "" ]; then
+  CMD="$CMD -lemma $LEMMATIZER_MODEL"
+fi
+
+if [ "$MORPH_MODEL" != "" ]; then
+  CMD="$CMD -morph $MORPH_MODEL"
+fi
+
+echo "Executing: $CMD"
+$CMD
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+## There are three sets of options that need, may need to, and could be changed.
+## (1) deals with input and output. You have to set these (in particular, you need to provide models)
+## (2) deals with the jvm parameters and may need to be changed
+## (3) deals with the behaviour of the system
+
+## For further information on switches, see the source code, or run
+## java -cp srl-20100902.jar se.lth.cs.srl.http.HttpPipeline
+
+##################################################
+## (1) The following needs to be set appropriately
+##################################################
+Lang="eng"
+MODELDIR=`dirname $0`/../../models/eng/
+#TOKENIZER_MODEL=${MODELDIR}/en-token.bin #If tokenizer is blank, it will use some default (Stanford for English, Exner for Swedish, and whitespace otherwise)
+#TOKENIZER_MODEL="models/chi/stanford-chinese-segmenter-2008-05-21/data"  #Use this for chinese.
+LEMMATIZER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model
+POS_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model
+#MORPH_MODEL=${MODELDIR}/  #No morph model for English.
+PARSER_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.parser.model
+SRL_MODEL=${MODELDIR}/CoNLL2009-ST-English-ALL.anna-3.3.srl.model
+
+PORT=8072 #The port to listen on
+
+##################################################
+## (2) These ones may need to be changed
+##################################################
+JAVA="java" #Edit this i you want to use a specific java binary.
+MEM="4g" #Memory for the JVM, might need to be increased for large corpora.
+DIST_ROOT=`dirname $0`/..
+CP=${DIST_ROOT}/srl.jar
+for jar in ${DIST_ROOT}/lib/*.jar; do
+#    echo $jar
+    CP=${CP}:$jar
+done
+#exit 0;
+JVM_ARGS="-Djava.awt.headless=true -cp $CP -Xmx$MEM"
+# The java.awt.headless property is needed to render the images of dependency graphs if the server is executed remotely (and there is no GUI stuff involved anyway)
+
+##################################################
+## (3) The following changes the behaviour of the system
+##################################################
+#RERANKER="-reranker" #Uncomment this if you want to use a reranker too. The model is assumed to contain a reranker. While training, the corresponding parameter has to be provided.
+
+CMD="$JAVA $JVM_ARGS se.lth.cs.srl.http.SRLHttpPipeline $Lang $RERANKER -tagger $POS_MODEL -parser $PARSER_MODEL -srl $SRL_MODEL -port $PORT"
+
+if [ "$TOKENIZER_MODEL" != "" ]; then
+    CMD=${CMD}" -token $TOKENIZER_MODEL"
+fi
+
+if [ "$LEMMATIZER_MODEL" != "" ]; then
+  CMD="$CMD -lemma $LEMMATIZER_MODEL"
+fi
+
+if [ "$MORPH_MODEL" != "" ]; then
+  CMD="$CMD -morph $MORPH_MODEL"
+fi
+
+echo "Executing: $CMD"
+$CMD