Index: branches/weighted-transfer/apertium-weights-learner/README.md =================================================================== --- branches/weighted-transfer/apertium-weights-learner/README.md (revision 72315) +++ branches/weighted-transfer/apertium-weights-learner/README.md (revision 72317) @@ -1,6 +1,6 @@ # apertium-weights-learner -This is a python3 script that can be used for transfer weights training (see http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules). +This is a python3 script that can be used for transfer weights training (see http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules). For now, it only allows for fully lexicalized patterns to be extracted (i.e., a sequence of tokens with lemmas and full sets of tags). ## Prerequisites To run this version of transfer weights training for a given language pair, you need: @@ -9,9 +9,6 @@ * language pair of interest with ambiguous rules marked with ids (for an example, see the version of en-es pair from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/) * kenlm (https://kheafield.com/code/kenlm/) -## Get the corpora -The corpora can be obtained from http://www.statmt.org/wmt12/translation-task.html - ## Prepare language model In order to run the training, you need to make a language model for your target language. @@ -54,7 +51,7 @@ In order to ensure that everything works fine, you may perform a sample run using prepared corpus: * Download and unpack all parts (years 2007 to 2011) of Spanish news crawl corpora from http://www.statmt.org/wmt12/translation-task.html -* Concatenate them using glue.py script from tools folder (i.e., assuming you are in apertium-weights-learner folder): +* Concatenate them using glue.py script from 'tools' folder (i.e., assuming you are in apertium-weights-learner folder): ``` tools/glue.py path/to/folder/where/corpus/parts/are glued_corpus ``` @@ -121,10 +118,13 @@ This would mean that '-ns' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun. ## Pruning -You can also prune the obtained weights file with prune.py script from tools folder. Pruning is a process of eliminating redundant weighted patterns, i.e.: +You can also prune the obtained weights file with prune.py script from 'tools' folder. Pruning is a process of eliminating redundant weighted patterns, i.e.: For each rule group: for each pattern that is present in more than one rule: * keep only the entry in the rule with the highest weight, and set the weight to 1 * if the rule with the entry with weight = 1 happens to be the default (first) rule, remove that entry from the weights file altogether, since it will be the rule applied anyway. -The idea behind the pruning process is that in fact, we only want to weight exceptions from the default rule. Pruning doesn't offer any speed advantages with the current realization but might be useful in the future. +The idea behind the pruning process is that in fact, we only want to weight exceptions from the default rule. Pruned weights file doesn't offer any significant speed advantages with the current realization but it still reduces memory footprint at translation time and this allows to learn weights from bigger corpora. + +## Testing +Once the weights are obtained, their impact can be tested on a parallel corpus using the 'weights-test.sh' script from the 'testing' folder, which contains a simple config akin to the weights learning script. Index: branches/weighted-transfer/apertium-weights-learner/testing/bleu_test.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/testing/bleu_test.py (nonexistent) +++ branches/weighted-transfer/apertium-weights-learner/testing/bleu_test.py (revision 72317) @@ -0,0 +1,35 @@ +#! /usr/bin/python3 + +import re, sys +from nltk import bleu_score + +word_re = re.compile('\w+') + +def prepare_corpus(fname, ref=False): + text = [] + with open(fname, 'r', encoding='utf-8') as ifile: + for line in ifile: + if ref: + text.append([word_re.findall(line)]) + else: + text.append(word_re.findall(line)) + return text + +if __name__ == "__main__": + if len(sys.argv) != 4: + print("Usage: ./bleu_test.py REFERENCE UNWEIGHTED_TRANSLATION WEIGHTED_TRANSLATION") + sys.exit(1) + + ref_corpus = prepare_corpus(sys.argv[1], ref=True) + unw_corpus = prepare_corpus(sys.argv[2]) + wei_corpus = prepare_corpus(sys.argv[3]) + + print("\nCorpus BLEU") + + print("Unweighted:", bleu_score.corpus_bleu(ref_corpus, unw_corpus)) + print("Weighted:", bleu_score.corpus_bleu(ref_corpus, wei_corpus)) + + print("\nAverage sentence BLEU") + + print("Unweighted:", sum(bleu_score.sentence_bleu(ref, hyp) for ref, hyp in zip(ref_corpus, unw_corpus) if ref != [[]] and hyp != []) / len(ref_corpus)) + print("Weighted:", sum(bleu_score.sentence_bleu(ref, hyp) for ref, hyp in zip(ref_corpus, wei_corpus) if ref != [[]] and hyp != []) / len(ref_corpus)) Property changes on: branches/weighted-transfer/apertium-weights-learner/testing/bleu_test.py ___________________________________________________________________ Added: svn:executable ## -0,0 +1 ## +* \ No newline at end of property Index: branches/weighted-transfer/apertium-weights-learner/testing/weights-test.sh =================================================================== --- branches/weighted-transfer/apertium-weights-learner/testing/weights-test.sh (nonexistent) +++ branches/weighted-transfer/apertium-weights-learner/testing/weights-test.sh (revision 72317) @@ -0,0 +1,21 @@ +#! /bin/sh + +PAIR_FOLDER="../../apertium-en-es" +PAIR_PREFIX=$PAIR_FOLDER/apertium-en-es.en-es +WEIGHTS_FILE=$PAIR_FOLDER/2007-en-30000-rule-weights-prunned.w1x +PAIR_NAME="en-es" +SOURCE_CORPUS="nc-v7-100000.es-en.en" +REFERENCE_CORPUS="nc-v7-100000.es-en.es" +UNWEIGHTED_OUTPUT="nc-v7-100000.es-en.en.es.unweighted" +WEIGHTED_OUTPUT="nc-v7-100000.es-en.en.es.weighted" +SCRIPTS_FOLDER="SCRIPTS" + +echo "Translating unweighted" + +time apertium -d $PAIR_FOLDER/ $PAIR_NAME-tagger $SOURCE_CORPUS | apertium-pretransfer | lt-proc -b $PAIR_FOLDER/$PAIR_NAME.autobil.bin | apertium-transfer -b $PAIR_PREFIX.t1x $PAIR_FOLDER/$PAIR_NAME.t1x.bin 2>unweighted.log | apertium-interchunk $PAIR_PREFIX.t2x $PAIR_FOLDER/$PAIR_NAME.t2x.bin | apertium-postchunk $PAIR_PREFIX.t3x $PAIR_FOLDER/$PAIR_NAME.t3x.bin | lt-proc -g $PAIR_FOLDER/$PAIR_NAME.autogen.bin | apertium-retxt | sed 's/[*#@~]//g' > $UNWEIGHTED_OUTPUT + +echo "\nTranslating weighted" + +time apertium -d $PAIR_FOLDER/ $PAIR_NAME-tagger $SOURCE_CORPUS | apertium-pretransfer | lt-proc -b $PAIR_FOLDER/$PAIR_NAME.autobil.bin | apertium-transfer -bw $WEIGHTS_FILE $PAIR_PREFIX.t1x $PAIR_FOLDER/$PAIR_NAME.t1x.bin 2>weighted.log | apertium-interchunk $PAIR_PREFIX.t2x $PAIR_FOLDER/$PAIR_NAME.t2x.bin | apertium-postchunk $PAIR_PREFIX.t3x $PAIR_FOLDER/$PAIR_NAME.t3x.bin | lt-proc -g $PAIR_FOLDER/$PAIR_NAME.autogen.bin | apertium-retxt | sed 's/[*#@~]//g' > $WEIGHTED_OUTPUT + +python3 bleu_test.py $REFERENCE_CORPUS $UNWEIGHTED_OUTPUT $WEIGHTED_OUTPUT Property changes on: branches/weighted-transfer/apertium-weights-learner/testing/weights-test.sh ___________________________________________________________________ Added: svn:executable ## -0,0 +1 ## +* \ No newline at end of property Index: branches/weighted-transfer/apertium-weights-learner/tools/coverage.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/tools/coverage.py (revision 72315) +++ branches/weighted-transfer/apertium-weights-learner/tools/coverage.py (revision 72317) @@ -69,9 +69,9 @@ root = transtree.getroot() cat_dict = {} for def_cat in root.find('section-def-cats').findall('def-cat'): - for cat_item in def_cat.findall('cat-item'): # make a regex line to recognize lemma-tag pattern - re_line = cat_item_to_re(cat_item) + re_line = '|'.join(cat_item_to_re(cat_item) + for cat_item in def_cat.findall('cat-item')) # add empty category list if there is none cat_dict.setdefault(re_line, []) # add category to the list Index: branches/weighted-transfer/apertium-weights-learner/twlconfig.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/twlconfig.py (revision 72315) +++ branches/weighted-transfer/apertium-weights-learner/twlconfig.py (revision 72317) @@ -5,9 +5,7 @@ #mode = "parallel" # full path to source language corpus from which to learn the rules -#source_language_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-en-100.txt" -source_language_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt" -#source_language_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/nc-v7.es-en.en.100.txt" +source_language_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-en-100000.txt" # full path to target language corpus (only for parallel mode) #target_language_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/nc-v7.es-en.es.100.txt" Index: branches/weighted-transfer/apertium-weights-learner/twlearner.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/twlearner.py (revision 72315) +++ branches/weighted-transfer/apertium-weights-learner/twlearner.py (revision 72317) @@ -156,7 +156,7 @@ ambiguous_rules, rule_id_map, translator, weighted_translator, ofile) lines_count += 1 - if lines_count % 100 == 0: + if lines_count % 1000 == 0: print('\n{} total lines\n{} total sentences'.format(lines_count, total_sents_count)) print('{} ambiguous sentences\n{} ambiguous chunks'.format(ambig_sents_count, ambig_chunks_count)) print('{} botched coverages\nanother {:.4f} elapsed'.format(botched_coverages, clock() - lbtime)) @@ -455,7 +455,7 @@ pass lines_count += 1 - if lines_count % 100 == 0: + if lines_count % 1000 == 0: print('\n{} total lines\n{} ambiguous chunks'.format(lines_count, ambig_chunks_count)) print('{} botched coverages\nanother {:.4f} elapsed'.format(botched_coverages, clock() - lbtime)) gc.collect()