TODO: optimization: reduce_lexical_lattice() shows up on the profile at 2% for ERG parsing aged; for GG it may be much worse, given the higher level of lexical ambiguity. the algorithm is O(n^2) in the number of lexemes activated, and calls dg_path() n^2 times. we only need to call it n times; that is worth fixing. fully automated regression testing two cases: 1. changes to ACE compare ACE results before to ACE results after profile for before profile for after expect no diffs, except when changes dictate should be able to automatically run for all kinds of test suites 2. changes to a grammar in general, not expected to turn up any new ACE bugs. but might turn up an old ACE bug. finding that is hard. compare ACE to PET and ACE to LKB before, write down diffs compare ACE to PET and ACE to LKB *after*, write down diffs compare the "before" diffs to the "after" diffs case 1 above we should be able to fully automate. may as well look at: both parsing and generation ERG: mrs, csli, hike, maybe something else too HaG: hausa.items GG: mrs, babel berthold wants - possibly commandline access to ACE QC generator - add post-generation token mapping phase (and presumably an inverse REPP phase) + started playing with this with ERG. I can extract a lattice of lexemes from a generation result, and apply a suite of lattice-mapping rules to it (my test case was affixing possessive markers: John 's -> John's - should also be able to do a rule to uppercase the first word in a sentence... - and a rule to join words when the first ends in a hyphen - how about changing underscores to spaces? best done in a REPP phase? - add post-parsing and post-generation transfer phase (like fixup) - ace as a generation server a-la-LKB + ace parses 34 readings vs pet/lkb 20 for GG mrs item 95 a MWE aligned two different ways... second token generic vs second token native (not blocked by grammar) - *updating* an ace tsdb profile -- doesn't work, apparently? - LUI support -generator trees -full chart -partial chart -simple MRS ideas: - automatic time profile of a grammar parsing: pretty good, but add - orthographemic analysis - each rule for lexical parsing - MRS extraction - idiom checking generation: + fixup rule application + each rule - semantic index lookup - trigger rule application - each rule - main generation - unpacking - MRS extraction - subsumption checking - idiom checking? DONE: unreleased 0.9.4pre1 - bug fixes in fixup - moving towards configurability for transfer-only grammars 0.9.3 - basic version of output-enabled transfer rules, used for mrs fixup for generation - changed post-generation subsumption test to operate on *external* MRSes 0.9.3pre2 - bigger TDL buffer: JaCY's QC skeleton needed it - allow lexemes to have STEM parts that aren't strings 0.9.3pre1 - better parsing of MRS string constants - config option to specify whether LTOP is extracted or invented - fix QC-from-instance loader - changes that speed up lattice mapping a lot in some cases - profiling mode "-i", for parsing - improved filtering of orthographemic rule chain hypotheses 0.9.2 - fix TDL reader/dagify to not conflate coreferences with the same name in different :+ addenda - fix a unicode bug in token mapping - fixed an obscure bug wherein GLB types could have incorrect constraints - don't load token mapping rules when token mapping is disabled - support more than 256 features - support ^ and $ in token mapping positional constraints - support spaces showing up in more unexpected places in TDL syntax - the path to the label within parse-node instances is now configurable (LNAME feature) - semantic indexing now pays attention to the lex-rels-path and rule-rels-path configurations - quickcheck can be loaded from a PET QC instance - new configuration options for limiting the number of orthographemic rules to apply - new configuration option to specify how much room to preallocate in the freezer. - new configuration option to specify what file[s] to load irregulars forms from - new configuration option to specify a suffix for rule names given in irregulars tables 0.9.1 - remove dependence on #include - change ~/logon/ to ${LOGONROOT} in Makefile 0.9 ----- - potentially use POS tagger to prune lexical ambiguity - but POS tagger makes mistakes... - preprocessor characterization problem: ... hole for 'n' ... hole for ''' ... hole for 't' ... lost 'n'=>4 filling hole 't' with 'n' ... lost '''=>5 ... lost 't'=>6 `(.) +n't' -> `\1n't' yields ` I don't know. ' - have a unicode.c file with wide-char and mbs routines in it + debug: I believe there are 15 747 724 136 275 002 577 605 653 961 181 555 468 044 717 914 527 116 709 366 231 425 076 185 631 031 296 296 protons in the universe and the same number of electrons. + currently, that results in a "too much RAM" error. that seems reasonable. delete deleted daughters from daughters of rule, since they can't ever be present in unified-in daughters figure out why unpacking is so much slower than PET (sometimes 10x slower!) - mrs extraction is the big bottleneck - approach one: memoization; most bits and pieces of MRS are reused many times - tried this on mrs_var's with good results, should also do ep's and hcon's - type names perhaps should be pointers to types, since much time is spent looking up and comparing types - maybe some other aspects of unpacking are slow? not clear yet. learn something from head-corner strategy [x] freeze qc settings into grammar file - cool: we compile it as a dynamic library at grammar load time, and then dlopen() it at parse time. ignore punctuation-chars for jacy try to prove/analyze interesting things at grammar compile time - *intelligently* auto pick quickcheck paths - make quickcheck used for packing ignore packing restrictor - and the shameless hack(tm) in qc.c with 0-1-lists - e.g. the INFLECTD is monotonic, '-' becomes '+' - this would help rule out applying inflectional rules like non3sg_fin_verb when 3sg_fin_verb is on the orth-agenda - might be lots of other such features we could find - auto pick packing restrictor - for each type of rule, precompute segments of the rule which are non-reentrant with ARGS[k] - then on filling in ARGS[k], copy() can know to structure-share without looking at those segments optimize lexicon storage; maybe use provided lexdb schema maybe support lexdb make STEM actually be updated by ortho rules; apparently some grammars depend on it. [x] make unpacking support 2+-ary - the latest ERG versions have several 4-ary rules and one 5-ary! loading: use an instance to define configuration paths? worthwhile experiment: check to be sure we're actually doing maxent correctly! - we weren't, quite: we were ignoring scores for surface strings - we get the gold tree for CSLI 79.2% of the time now - we get the gold tree for the first 70 trees in WS01 67.1% of the time - looks like it's working properly.