parser.py can read kres and/or ssj500k

5 years ago · d1ba56be37
parent 648f4e53d2
commit d1ba56be37
14 changed files with 437 additions and 82 deletions
--- a/README.md
+++ b/README.md
@ -6,93 +6,18 @@ The tools require Java.
 See `./dockerfiles/mate-tool-env/README.md` for environment preparation. 
 ## mate-tools
-Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
+Check out `./tools/srl-20131216/README.md`.  
 Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
-Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
+## Scripts
 Check all possible xml tags (that occur after the <body> tag.  
 'cat F0006347.xml.parsed.xml | grep -A 999999999999 -e '<body>' | grep -o -e '<[^" "]*' | sort | uniq'
-### train
+## Tools
-Create the `model-file`:
+* Parser for reading both `SSJ500k 2.1 TEI xml` and `Kres F....xml.parsed.xml"` files found in `./tools/parser/parser.py`.  
 `--help` output:
 ```bash
 java -cp srl.jar se.lth.cs.srl.Learn --help
 Not enough arguments, aborting.
 Usage:
 java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
 Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
 trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
 <lang> corresponds to the language and is one of
 chi, eng, ger
 Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message
 Learning-specific options:
 -fdir <dir>             the directory with feature files (see below)
 -reranker               trains a reranker also (not done by default)
 -llbinary <file>        a reference to a precompiled version of liblinear,
                         makes training much faster than the java version.
 -partitions <int>       number of partitions used for the reranker
 -dontInsertGold         don't insert the gold standard proposition during
                         training of the reranker.
 -skipUnknownPredicates  skips predicates not matching any POS-tags from
                         the feature files.
 -dontDeleteTrainData    doesn't delete the temporary files from training
                         on exit. (For debug purposes)
 -ndPipeline             Causes the training data and feature mappings to be
                         derived in a non-deterministic way. I.e. training the pipeline
                         on the same corpus twice does not yield the exact same models.
                         This is however slightly faster.
 The feature file dir needs to contain four files with feature sets. See
 the website for further documentation. The files are called
 pi.feats, pd.feats, ai.feats, and ac.feats
 All need to be in the feature file dir, otherwise you will get an error.
 ```
 Input: `lang`, `input-corpus`. 
 ### parse
 `--help` output:
 ```bash
 $ java -cp srl.jar se.lth.cs.srl.Parse --help
 Not enough arguments, aborting.
 Usage:
 java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
 Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
 parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
 <lang> corresponds to the language and is one of
 chi, eng, ger
 Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message
 Parsing-specific options:
 -nopi           skips the predicate identification. This is equivalent to the
                 setting in the CoNLL 2009 ST.
 -reranker       uses a reranker (assumed to be included in the model)
 -alfa <double>  the alfa used by the reranker. (default 1.0)
 ```
 We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
 ### input data:
 * `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
 formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
 ## Sources
 * [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
 * [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
 * [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
-* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
+* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
--- a/dockerfiles/parser-env/Dockerfile
+++ b/dockerfiles/parser-env/Dockerfile
@ -0,0 +1,3 @@
 FROM python
 RUN pip install lxml
--- a/dockerfiles/parser-env/README.md
+++ b/dockerfiles/parser-env/README.md
@ -0,0 +1,13 @@
 You might want to mount this whole repo into the docker container.  
 Also mount data locations.  
 Example container:
 ```bash
 $ docker build . -t my_python
 $ docker run \
    -it \
    -v /home/kristjan/git/cjvt-srl-tagging:/cjvt-srl-tagging \
    -v /home/kristjan/some_corpus_data:/some_corpus_data \
    my_python \
    /bin/bash
 ```
--- a/tools/main.py
+++ b/tools/main.py
@ -0,0 +1,23 @@
 from parser import parser
 import os
 from os.path import join
 import sys
 SSJ500K_2_1 = 27829  # number of sentences
 if __name__ == "__main__":
 	# make sure you sanitize every input into unicode
 	print("parsing ssj")
 	# ssj_file = "/home/kristjan/git/diploma/data/ssj500k-sl.TEI/ssj500k-sl.body.xml"
 	# ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.xml"
 	ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.sample.xml"  # smaller file
 	ssj_dict = parser.parse_tei(ssj_file)
 	# assert (len(ssj_dict) == 27829), "Parsed wrong number of sentences."
 	print("parsing kres")
 	# kres_file = "../data/kres_example/F0019343.xml.parsed.xml"
 	kres_dir = "../data/kres_example/"
 	for kres_file in os.listdir(kres_dir):
 		parser.parse_tei(join(kres_dir, kres_file))
 	print("end parsing kres")
--- a/tools/parser/Parser.pyc
+++ b/tools/parser/Parser.pyc
--- a/tools/parser/init.py
+++ b/tools/parser/init.py
--- a/tools/parser/init.pyc
+++ b/tools/parser/init.pyc
--- a/tools/parser/pycache/init.cpython-37.pyc
+++ b/tools/parser/pycache/init.cpython-37.pyc
--- a/tools/parser/pycache/parser.cpython-37.pyc
+++ b/tools/parser/pycache/parser.cpython-37.pyc
--- a/tools/parser/bench_parser.py
+++ b/tools/parser/bench_parser.py
@ -0,0 +1,70 @@
 import xml.etree.ElementTree as ET
 import random
 random.seed(42)
 tree=ET.parse('../../data/kres_example/F0006347.xml.parsed.xmll')
 print(ET.tostring(tree))
 root=tree.getroot()
 train=[]
 dev=[]
 test=[]
 train_text=open('train.txt','w')
 dev_text=open('dev.txt','w')
 test_text=open('test.txt','w')
 for doc in root.iter('{http://www.tei-c.org/ns/1.0}div'):
  rand=random.random()
  if rand<0.8:
    pointer=train
    pointer_text=train_text
  elif rand<0.9:
    pointer=dev
    pointer_text=dev_text
  else:
    pointer=test
    pointer_text=test_text
  for p in doc.iter('{http://www.tei-c.org/ns/1.0}p'):
    for element in p:
      if element.tag.endswith('s'):
        sentence=element
        text=''
        tokens=[]
        for element in sentence:
          if element.tag[-3:]=='seg':
            for subelement in element:
              text+=subelement.text
              if not subelement.tag.endswith('}c'):
                if subelement.tag.endswith('w'):
                  lemma=subelement.attrib['lemma']
                else:
                  lemma=subelement.text
                tokens.append((subelement.text,lemma,subelement.attrib['ana'].split(':')[1]))
          if element.tag[-2:] not in ('pc','}w','}c'):
            continue
          text+=element.text
          if not element.tag.endswith('}c'):
            if element.tag.endswith('w'):
              lemma=element.attrib['lemma']
            else:
              lemma=element.text
            tokens.append((element.text,lemma,element.attrib['ana'].split(':')[1]))
        pointer.append((text,tokens))
        pointer_text.write(text.encode('utf8'))
      else:
        pointer_text.write(element.text.encode('utf8'))
    pointer_text.write('\n')
  #pointer_text.write('\n')
 def write_list(lst,fname):
  f=open(fname,'w')
  for text,tokens in lst:
    f.write('# text = '+text.encode('utf8')+'\n')
    for idx,token in enumerate(tokens):
      f.write(str(idx+1)+'\t'+token[0].encode('utf8')+'\t'+token[1].encode('utf8')+'\t_\t'+token[2]+'\t_\t_\t_\t_\t_\n')
    f.write('\n')
  f.close()
 write_list(train,'train.conllu')
 write_list(dev,'dev.conllu')
 write_list(test,'test.conllu')
 train_text.close()
 dev_text.close()
 test_text.close()
--- a/tools/parser/ozbolt.py
+++ b/tools/parser/ozbolt.py
@ -0,0 +1,144 @@
 #!/usr/bin/python3
 from __future__ import print_function, unicode_literals, division
 import sys
 import os
 import re
 import pickle
 from pathlib import Path
 try:
    from lxml import etree as ElementTree
 except ImportError:
    import xml.etree.ElementTree as ElementTree
 # attributes
 ID_ATTR = "id"
 LEMMA_ATTR = "lemma"
 ANA_ATTR = "ana"
 # tags
 SENTENCE_TAG = 's'
 BIBL_TAG = 'bibl'
 PARAGRAPH_TAG = 'p'
 PC_TAG = 'pc'
 WORD_TAG = 'w'
 C_TAG = 'c'
 S_TAG = 'S'
 SEG_TAG = 'seg'
 class Sentence:
    def __init__(self, sentence, s_id):
        self.id = s_id
        self.words = []
        self.text = ""
        for word in sentence:
            self.handle_word(word)
    def handle_word(self, word):
        # handle space after
        if word.tag == S_TAG:
            assert(word.text is None)
            self.text += ' '
            return
        # ASK am I handling this correctly?
        elif word.tag == SEG_TAG:
            for segword in word:
                self.handle_word(segword)
            return
        # ASK handle unknown tags (are there others?)
        elif word.tag not in (WORD_TAG, C_TAG):
            return
        # ID
        idx = str(len(self.words) + 1)
        # TOKEN
        token = word.text
        # LEMMA
        if word.tag == WORD_TAG:
            lemma = word.get(LEMMA_ATTR)
            assert(lemma is not None)
        else:
            lemma = token
        # XPOS
        xpos = word.get('msd')
        if word.tag == C_TAG:
            xpos = "Z"
        elif xpos in ("Gp-ppdzn", "Gp-spmzd"):
            xpos = "N"
        elif xpos is None:
            print(self.id)
        # save word entry
        self.words.append(['F{}.{}'.format(self.id, idx), token, lemma, xpos])
        # save for text
        self.text += word.text
    def to_conllu(self):
        lines = []
        # lines.append('# sent_id = ' + self.id)
        # CONLLu does not like spaces at the end of # text
        # lines.append('# text = ' + self.text.strip())
        for word in self.words:
            lines.append('\t'.join('_' if data is None else data for data in word))
        return lines
 def convert_file(in_file, out_file):
    print("Nalaganje xml: {}".format(in_file))
    with open(str(in_file), 'r') as fp:
        xmlstring = re.sub(' xmlns="[^"]+"', '', fp.read(), count=1)
        xmlstring = xmlstring.replace(' xml:', ' ')
        xml_tree = ElementTree.XML(xmlstring)
    print("Pretvarjanje TEI -> TSV-U ...")
    lines = []
    for pidx, paragraph in enumerate(xml_tree.iterfind('.//body/p')):
        sidx = 1
        for sentence in paragraph:
            if sentence.tag != SENTENCE_TAG:
                continue
            sentence = Sentence(sentence, "{}.{}".format(pidx + 1, sidx))
            lines.extend(sentence.to_conllu())
            lines.append('') # ASK newline between sentences
            sidx += 1
    if len(lines) == 0:
        raise RuntimeError("Nobenih stavkov najdenih")
    print("Zapisovanje izhodne datoteke: {}".format(out_file))
    with open(out_file, 'w') as fp:
        for line in lines:
            if sys.version_info < (3, 0):
                line = line.encode('utf-8')
            print(line, file=fp)
 if __name__ == "__main__":
    """
    Input: folder of TEI files, msds are encoded as msd="Z"
    Ouput: just a folder
    """
    in_folder = sys.argv[1]
    out_folder = sys.argv[2]
    num_processes = int(sys.argv[3])
    files = Path(in_folder).rglob("*.xml")
    in_out = []
    for filename in files:
        out_file = out_folder + "/" + filename.name[:-4] + ".txt"
        convert_file(filename, out_file)
--- a/tools/parser/parser.py
+++ b/tools/parser/parser.py
@ -0,0 +1,86 @@
 from lxml import etree
 import re
 W_TAGS = ['w']
 C_TAGS = ['c']
 S_TAGS = ['S', 'pc']
 # reads a TEI xml file and returns a dictionary:
 # { <sentence_id>: {
 # 		sid: <sentence_id>,  # serves as index in MongoDB
 # 		text: ,
 # 		tokens: ,	
 # }}
 def parse_tei(filepath):
 	guess_corpus = None  # SSJ | KRES
 	res_dict = {}
 	with open(filepath, "r") as fp:
 		# remove namespaces
 		xmlstr = fp.read()
 		xmlstr = re.sub('\\sxmlns="[^"]+"', '', xmlstr, count=1)
 		xmlstr = re.sub(' xml:', ' ', xmlstr)
 		root = etree.XML(xmlstr.encode("utf-8"))
 		divs = []  # in ssj, there are divs, in Kres, there are separate files
 		if "id" in root.keys():
 			# Kres files start with <TEI id=...>
 			guess_corpus = "KRES"
 			divs = [root]
 		else:
 			guess_corpus = "SSJ"
 			divs = root.findall(".//div")
 		# parse divs
 		for div in divs:
 			f_id = div.get("id")
 			# parse paragraphs
 			for p in div.findall(".//p"):
 				p_id = p.get("id").split(".")[-1]
 				# parse sentences
 				for s in p.findall(".//s"):
 					s_id = s.get("id").split(".")[-1]
 					sentence_text = ""
 					sentence_tokens = []
 					# parse tokens
 					for el in s.iter():
 						if el.tag in W_TAGS:
 							el_id = el.get("id").split(".")[-1]
 							if el_id[0] == 't':
 								el_id = el_id[1:]  # ssj W_TAG ids start with t
 							sentence_text += el.text
 							sentence_tokens += [(
 								"w",
 								el_id, 
 								el.text, 
 								el.get("lemma"), 
 								(el.get("msd") if guess_corpus == "KRES" else el.get("ana").split(":")[-1]),
 							)]
 						elif el.tag in C_TAGS:
 							el_id = el.get("id") or "none"  # only Kres' C_TAGS have ids
 							el_id = el_id.split(".")[-1]
 							sentence_text += el.text
 							sentence_tokens += [("c", el_id, el.text,)]
 						elif el.tag in S_TAGS:
 							sentence_text += " "  # Kres' <S /> doesn't contain .text
 						else:
 							# pass links and linkGroups
 							# print(el.tag)
 							pass
 					sentence_id = "{}.{}.{}".format(f_id, p_id, s_id)
 					"""
 					print(sentence_id)
 					print(sentence_text)
 					print(sentence_tokens)
 					"""
 					if sentence_id in res_dict:
 						raise KeyError("duplicated id: {}".format(sentence_id))
 					res_dict[sentence_id] = {
 						"sid": sentence_id,
 						"text": sentence_text,
 						"tokens": sentence_tokens
 					}
 	return res_dict
--- a/tools/parser/parser.pyc
+++ b/tools/parser/parser.pyc
--- a/tools/srl-20131216/README.md
+++ b/tools/srl-20131216/README.md
@ -0,0 +1,91 @@
 # mate-tools
 Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
 Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
 Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
 ## train
 Create the `model-file`:
 `--help` output:
 ```bash
 java -cp srl.jar se.lth.cs.srl.Learn --help
 Not enough arguments, aborting.
 Usage:
 java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
 Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
 trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
 <lang> corresponds to the language and is one of
 chi, eng, ger
 Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message
 Learning-specific options:
 -fdir <dir>             the directory with feature files (see below)
 -reranker               trains a reranker also (not done by default)
 -llbinary <file>        a reference to a precompiled version of liblinear,
                         makes training much faster than the java version.
 -partitions <int>       number of partitions used for the reranker
 -dontInsertGold         don't insert the gold standard proposition during
                         training of the reranker.
 -skipUnknownPredicates  skips predicates not matching any POS-tags from
                         the feature files.
 -dontDeleteTrainData    doesn't delete the temporary files from training
                         on exit. (For debug purposes)
 -ndPipeline             Causes the training data and feature mappings to be
                         derived in a non-deterministic way. I.e. training the pipeline
                         on the same corpus twice does not yield the exact same models.
                         This is however slightly faster.
 The feature file dir needs to contain four files with feature sets. See
 the website for further documentation. The files are called
 pi.feats, pd.feats, ai.feats, and ac.feats
 All need to be in the feature file dir, otherwise you will get an error.
 ```
 Input: `lang`, `input-corpus`. 
 ### parse
 `--help` output:
 ```bash
 $ java -cp srl.jar se.lth.cs.srl.Parse --help
 Not enough arguments, aborting.
 Usage:
 java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
 Example:
 java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
 parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
 <lang> corresponds to the language and is one of
 chi, eng, ger
 Options:
 -aibeam <int>     the size of the ai-beam for the reranker
 -acbeam <int>     the size of the ac-beam for the reranker
 -help             prints this message
 Parsing-specific options:
 -nopi           skips the predicate identification. This is equivalent to the
                 setting in the CoNLL 2009 ST.
 -reranker       uses a reranker (assumed to be included in the model)
 -alfa <double>  the alfa used by the reranker. (default 1.0)
 ```
 We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
 ## input data:
 * `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
 formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
 ## Sources
 * [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
 * [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
 * [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
 * [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll