Index: branches/weighted-transfer/apertium-weights-learner/README.md =================================================================== --- branches/weighted-transfer/apertium-weights-learner/README.md (revision 72158) +++ branches/weighted-transfer/apertium-weights-learner/README.md (revision 72159) @@ -5,7 +5,8 @@ ## Prerequisites To run this version of transfer weights training for a given language pair, you need: * source and target language corpora (they don't have to be parallel to each other) -* apertium with the language pair of interest +* apertium with apertium-transfer modified to use transfer weights (may be checked out from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/) +* language pair of interest with ambiguous rules marked with ids (for an example, see the version of en-es pair from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/) * kenlm (https://kheafield.com/code/kenlm/) ## Get the corpora @@ -12,7 +13,7 @@ The corpora can be obtained from http://www.statmt.org/wmt12/translation-task.html ## Prepare language model -You need to make a language model for your target language. +In order to run the training, you need to make a language model for your target language. * First, take a big corpus, tokenize and normalize it using tools/simpletok.py script: ``` @@ -28,16 +29,83 @@ ``` bin/lmplz -o 5 -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa ``` -Be advised that you might need disk space for your language model rougly 15 times of the corpus volume. +Be advised that you might need disk space rougly 15 times the corpus volume for your language model. You can also use gz in order to compress the model file during its creation reducing its size: +``` +bin/lmplz -o 5 -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa.gz +``` * It is highly recommended that you compile a binary after that as it works significantly faster: ``` bin/build_binary -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa MODEL_NAME.mmap ``` -Be advised that you might need disk space for your language model rougly half of the arpa file volume. +or, if you used gz in the previous step: +``` +bin/build_binary -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa.gz MODEL_NAME.mmap +''' +Be advised that you might need additional disk space rougly half the arpa file volume for your binary. ## Run training -* Edit configuration file twlconfig.py, which is self-explanatory. +* Edit configuration file twlconfig.py, which is (hopefully) self-explanatory. * Run training script: ``` ./twlearner.py ``` + +## Sample run +In order to ensure that everything works fine, you may perform a sample run using prepared corpus: + +* Download all parts (years 2007 to 2011) of Spanish news crawl corpora from http://www.statmt.org/wmt12/translation-task.html +* Concatenate them using glue.py script from tools folder. +* Train a language model on the resulting corpus. +* Check out the en-es pair from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/ +* Run weights training on new-software-sample.txt file located in the data folder with the en-es pair. + +The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved. + +The contents of the unpruned w1x file should look like the following: +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +This would mean that '-ns' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun. Index: branches/weighted-transfer/apertium-weights-learner/coverage.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/coverage.py (revision 72158) +++ branches/weighted-transfer/apertium-weights-learner/coverage.py (revision 72159) @@ -217,11 +217,13 @@ # try to continue each coverage obtained on the previous step for coverage, state in zip(coverage_list, state_list): + # first, check if we can go further along current pattern if (state, cat) in self.transitions: # current pattern can be made longer: add one more token new_coverage_list.append(coverage + [('w', token)]) new_state_list.append(self.transitions[(state, cat)]) + # if not, check if we can finalize current pattern elif state in self.final_states: # current state is one of the final states: close previous pattern new_coverage = coverage + [('r', self.final_states[state])] @@ -232,12 +234,13 @@ new_state_list.append(self.transitions[(self.start_state, cat)]) elif '*' in token: # can not start new pattern because of an unknown word - new_coverage_list.append(new_coverage + [('w', token), ('r', -1)]) + new_coverage_list.append(new_coverage + [('w', token), ('r', 'unknown')]) new_state_list.append(self.start_state) + # if not, check if it is just an unknown word elif state == self.start_state and '*' in token: # unknown word at start state: add it to pattern, start new - new_coverage_list.append(coverage + [('w', token), ('r', -1)]) + new_coverage_list.append(coverage + [('w', token), ('r', 'unknown')]) new_state_list.append(self.start_state) # if nothing worked, just discard this coverage @@ -250,7 +253,7 @@ if state in self.final_states: # current state is one of the final states: close the last pattern new_coverage_list.append(coverage + [('r', self.final_states[state])]) - elif coverage[-1][0] == 'r': + elif coverage != [] and coverage[-1][0] == 'r': # the last pattern is already closed new_coverage_list.append(coverage) # if nothing worked, just discard this coverage as incomplete @@ -272,13 +275,16 @@ pattern = [] formatted_coverage_list.append(formatted_coverage) + # now we filter out some not-lrlm coverages + # that still got into + # sort coverages by signature, which is a tuple # of coverage part lengths formatted_coverage_list.sort(key=signature, reverse=True) signature_max = signature(formatted_coverage_list[0]) - # keep only those with top signature in terms of signature - # they would be LRLM ones + # keep only those with top signature + # they would be the LRLM ones LRLM_list = [] for coverage in formatted_coverage_list: if signature(coverage) == signature_max: @@ -300,7 +306,7 @@ cat_dict, rules, ambiguous_rules, rule_id_map = prepare('../apertium-en-es/apertium-en-es.en-es.t1x') pattern_FST = FST(rules) - coverages = pattern_FST.get_lrlm('^publish$ ^in$ ^the$ ^journal$ ^of$ ^the$ ^american$ ^medical$ ^association$ ^the$ ^study$ ^track$ ^the$ ^mental$ ^health$ ^of$ ^88,000$ ^army$ ^combat$ ^veteran$ ^by$ ^compare$ ^their$ ^response$ ^in$ ^a$ ^mental$ ^health$ ^questionnaire$ ^fill$ ^out$ ^upon$ ^their$ ^return$ ^home$ ^with$ ^a$ ^second$ ^mental$ ^health$ ^screening$ ^three$ ^to$ ^six$ ^month$ ^later$^.$', cat_dict) + coverages = pattern_FST.get_lrlm('^prpers$ ^want# to$ ^wait$ ^until$ ^prpers$ ^can$ ^offer$ ^what$ ^would$ ^be$ ^totally$ ^satisfy$ ^for$ ^consumer$^.$', cat_dict) print('Coverages detected:') for coverage in coverages: print(coverage) Index: branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt =================================================================== --- branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt (nonexistent) +++ branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt (revision 72159) @@ -0,0 +1,3 @@ +Mr Stephen said the council had agreed to consider new software which would make the test more difficult. +What's Next: Simonyi's new software writes its own code +This new software makes it easier to get a movie done quickly, though harder to get it done well. Index: branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py (revision 72158) +++ branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py (revision 72159) @@ -4,7 +4,7 @@ beforepunc_re = re.compile(r'([¿("/])(\w)') afterpunc_re = re.compile(r'(\w)([;:,.!?)"/—])') -quot_re = re.compile("[«»']") +quot_re = re.compile("[«»`'“”„‘’‛]") numfix_re = re.compile('([0-9]) ([,.:][0-9])') beforedash_re = re.compile(r'(\W)-(\w)') afterdash_re = re.compile(r'(\w)-(\W)') Index: branches/weighted-transfer/apertium-weights-learner/twlconfig.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/twlconfig.py (revision 72158) +++ branches/weighted-transfer/apertium-weights-learner/twlconfig.py (revision 72159) @@ -1,7 +1,8 @@ # full path to source corpus from which to learn the rules -source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-100-special.txt" +#source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-100-special.txt" +source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt" -# name of apertium pair (not direction) +# name of apertium language pair (not translation direction) apertium_pair_name = "en-es" # full path to apertium language pair data folder Index: branches/weighted-transfer/apertium-weights-learner/twlearner.py =================================================================== --- branches/weighted-transfer/apertium-weights-learner/twlearner.py (revision 72158) +++ branches/weighted-transfer/apertium-weights-learner/twlearner.py (revision 72159) @@ -15,7 +15,7 @@ tmpweights_fname = 'tmpweights.w1x' # regular expression to cut out a sentence -sent_re = re.compile('.*?\$') +sent_re = re.compile('.*?\$|.+?$') # anything between $ and ^ inter_re = re.compile(r'\$.*?\^') @@ -31,7 +31,7 @@ # for scoring against language model beforepunc_re = re.compile(r'([¿("/])(\w)') afterpunc_re = re.compile(r'(\w)([;:,.!?)"/—])') -quot_re = re.compile("[«»']") +quot_re = re.compile("[«»`'“”„‘’‛]") numfix_re = re.compile('([0-9]) ([,.:][0-9])') beforedash_re = re.compile(r'(\W)-(\w)') afterdash_re = re.compile(r'(\w)-(\W)') @@ -144,6 +144,7 @@ # look at each sentence in line for sent_match in sent_re.finditer(line.strip()): + if sent_match.group(0) != '': total_sents_count += 1 # get coverages @@ -150,16 +151,16 @@ coverage_list = pattern_FST.get_lrlm(sent_match.group(0), cat_dict) if coverage_list == []: botched_coverages += 1 - print('Botched coverage:', sent_match.group(0)) - print() + #print('Botched coverage:', sent_match.group(0)) + #print() else: # look for ambiguous chunks coverage_item = coverage_list[0] pattern_list = search_ambiguous(ambiguous_rules, coverage_item) if pattern_list != []: - print('Coverage:', coverage_item) - print('Pattern list:', pattern_list) - print() + #print('Coverage:', coverage_item) + #print('Pattern list:', pattern_list) + #print() ambig_sents_count += 1 # segment the sentence into parts each containing one ambiguous chunk sentence_segments, prev = [], 0 @@ -314,16 +315,20 @@ # read and process other lines for line in ifile: group_number, rule_number, pattern, weight = line.rstrip('\n').split('\t') - if pattern != prev_pattern: - # pattern changed, flush previous + if group_number != prev_group_number: + # rule group changed, flush pattern, close previuos, open new ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight)) total_pattern_weight = 0. - if group_number != prev_group_number: - # rule group changed, close previuos, open new ofile.write(' \n \n \n \n'.format(rule_map[rule_number])) elif rule_number != prev_rule_number: - # rule changed, close previuos, open new + # rule changed, flush pattern, close previuos rule, open new + ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight)) + total_pattern_weight = 0. ofile.write(' \n \n'.format(rule_map[rule_number])) + elif pattern != prev_pattern: + # pattern changed, flush previous + ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight)) + total_pattern_weight = 0. # add up rule-pattern weights total_pattern_weight += float(weight) prev_group_number, prev_rule_number, prev_pattern = group_number, rule_number, pattern