BrainStorm

This is a working list of tasks considered to be open, opr a wishlist, or feature requests or just a place to collect stuff and ideas.

  • CorpusWork Development of WeightedConstraintDependencyGrammars in order to parse large-scale corpora like the PennTreebank and the NegraCorpus aswell as special purpose grammars for partial parsing and informatio retrieval. Developing WeightedConstraintDependencyGrammars should be done in a reusable way, that is differenciate common and application specific parts of a grammar. In order to achieve this grammars have to be structured in a new way allowing to include needed parts of one grammar in another. Grammars establish a hierarchy of dependency starting from language specific roots like a german and an english grammar collecting and establishing the basic needs and settings for derived grammars. As a consequence all derived grammars use the same tag-set, e.g. the german SttsTagset. By now the following cdg grammars exist or are planed to be written:
    • StellingenGrammar: a small test corpus in the domain of verbmobil using a stts-like tagset and a rather restricted set of dependecy labels. Basically the work on that grammar stopped and will be continued in the HeiseGrammar and the NegraGrammar which shall both be based on a DeutschGrammar
    • HeiseGrammar: a corpus intended to implement partial parsing and information retrieval, derived from the StellingenGrammar, whereas a rework of the tagset has been achieved using the official SttsTagset. The HeiseGrammar will serve as testbed in how far a DeutschGrammar is extractable. This grammar shall be used for information retrieval out of german newspaper articles from http://www.heise.de. As for plain-text corpus annotation is a bottelneck. But programming the cdg systems seems to be needed in oder to allow a SkimmingParser, that is building an easy cdgp first which might be reused for a deeper analysis. Skimming might be realized by implementing a two-stage grammar consisting of a sketchy mode and a deep mode. How far the parser reuses data is an open question.
    • NegraGrammar: a large-scale german newspaper corpus offering syntacically annotated data. Dependency annotations have to be generated from phrase structures. The kind of data has to be investigated urgently in order to find a common root for the NegraGrammar and the HeiseGrammar. The latter still uses the rather restricted set of dependency labels which might not be enuf for the NegraGrammar in oder to cover the annotated material in an accurate and comparable way. So this undertaking might heavily influence the development of the HeiseGrammar.
    • PennGrammar: a lage-scall english corpus, the corpus for publications and comparability with other linguistical work.
  • GrammarModelling: there are some general issues arround grammar modelling besides the already mentioned point of
    • GrammarReusage
    • modelling suboptimal solutions: it might be interesting for several reasons to support the second best solutions found aswell, and even be concerned about the landscape of the complete searchspace at all. In general the grammar writer is not interestet in suboptimal structures and one might argue that he never should. The problem here is that transformational solution methods like frobbing and gls want to provide an anytime-property, that is to offer the best solution found so far when being interrupted. But in the area of suboptimal solution candidates the penalty score and the accuracy are two measurements not being orthogonal most of the time. So while the transformation solver tries to optimize a penalty the accuracy might vary.
    • AuxiliaryLevels:
      • Can mirroring dependency edges be kept in sync in an automated way, so that no extensive propagation/search is needed? (investigated)
      • Has the mirror-property a comparable status as the tree-property?
      • What happens when OBL1 is switched off?
      • Idea: enrich level-features with something like "mirrorlevel=SYN mirrorlabel=C1" for the OBL1 level. This would couple the OBL1-non-root edges to the corresponding reverse edge labelled C1 on the SYN level (done)
    • Wcdg language extensions:
      1. what about an if-then-else
      2. what about an existence constraint (done)
      3. what about accessing information on connected parts of the dependency tree?
  • SystemCoding
    • RegularExpressions: templates might be specified using regexps as a word form. basically this works. further use has to be testet (done)
    • transformational solvers (gls, frobbing) could easily provied all ambigue solutions of the same quality
    • TestCases: wide parts of the system (arcconsitency, pruning, isearch, backmarking, ...) are out of order or untested in a way, so at some point we should decide wether we want to test, fix or depricate them
    • YadaTracker: yada should be established as a unified evaluation.
    • XcdgCoding:
      1. allow to display cyclic structures (done)
      2. add a zoom functionality in oder to display and edit large structures (done)
    • GlsTermination: enhance the termination behaviour:
      1. enhance phase detection
      2. ignore high costs in the init phase (done)
      3. only compute state utility in termination phase
      4. depricate current no. one termination criterion "costs" in favour of a better state utility measure (see above)
    • SkimmingParsing
    • IncrementalGuidedLocalSearch
    • ChunkerIntegration
    • GermaNetIntegration
    • WordNetIntegration
    • Parallelity: a framework to integrate different solution methods cooperatively, using pvm
    • ternary constraints: make propagation come true
    • Frobbing: What if we could analyse the source of penalties and switch it of temporarily? Basically, this might help the transformation planer to choose different violations to be attacted next, but also higher the penalty of possibly prommising solution candidates. In concret, one might to switch off the tagger at a certain point.
  • Documentation
    • big parts of the system are still undocumented:
      1. gls: technical docu
      2. compiler: technical docu
      3. yada: technical docu (done)
      4. yada: usermanual
      5. xcdg: technical docu
      6. isearch: technical docu after a rework of that module
    • ease technical documentation using doxygen, ala javadoc

-- MichaelDaum - 06 May 2002
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback