+ Dumb Czech Grammar

This page breifly describes a very simple grammar that I tried to develop semiautomatically, collecting statistical information from the PragueDependencyTreebank (PDT).

Context Description

So far, there are three parsers for Czech: Collins, Zeman and Zabokrtsky. Collins and Zeman are statistical, both trained on PDT (Collins converts dependency structure to phrase structure and learns on that; after analysis, the phrase structure is converted to a dependency tree. Zeman learns directly the dependencies and builds directly the dependency structure.) Parser by Zdenek Zabokrtsky was not published yet. It is a hand-made parser. Deterministic rules are used to "hang" nodes to appropriate positions. The parser never backtracks, although some rules are used to "correct" mistakes done by previous rules. Collins achieves ca. 83%, Zeman and Zabokrtsky ca.70% of unlabelled dependencies.

I'm not a linguist and I've never been trained to anotate sentences of PDT. My knowledge of Czech grammar is based on high-school education. This allows me to understand the "golden annotation" in PDT and understand precisely where there is the error. On the other hand, I'm aware of only the most basic grammatical rules.

Characteristic of Czech and PDT

Czech is a "free word order" language, and in fact, lots of word order permutations are acceptable. The syntactic structure of the sentence is revealed to the recipient mostly by means of rich flexion and several congruence rules.

However, as documented in PDT, most of the sentences are projective. (Didn't have enough time to measure exactly, sorry.)

The Czech morphological tagset is rich (xxx tags defined, ~xxx observed in the Czech National Corpus), the tags hold information of: part of speech (fine grained), number, gender, case, tense, grade (comparative/superlative), possessor's number and gender (differentiates "father's", "fathers'" and "mother's", for instance). Where the feature is not appropriate, no value is given.

Experiments

Here I describe the several distinct approaches to deriving a constraint-based grammar for Czech from PDT.

For purposes of easier analysis, I first converted PDT to a list of observed edges, every edge description on a line. Then I used "sort | uniq -c | sort -r -n" to count distinct (detailed) types of edges. These are the most common edges:

xxx

The negative table

The negative table is a set of hard constraints of this type:

xxx

The constraints forbid any edge labelled X between two nodes with categories (parts of speech) Cat1 and Cat2, if such an edge was not seen between two words of the given categories somewhere in PDT.

However, due to the relatively rich set of labels available and their varied usage, this set of constraints still leaves hundreds of solutions for a sentence open.

Preferred edges and preferred roots

  • roots added later
  • many many modelling errors

Complex prepositions

There are expressions in Czech that consist of more than one word but are treated as a single preposition in PDT. Generally, the preposition node is placed between the noun and its governor: sit\on\table

Complex prepositions of two words are generally in the form , such as "na zaklade", "on basis_of". These are rendered in PDT this way: decide\(on/basis_of\experience). The important difference is, that the single preposition "on" is no longer placed between two nodes, but is a leaf.

There are also complex prepositions of three words: "na rozdil od", "on the contrary to", xxx

Halfedges of varying granularity

Summary

I tried several methods to extract "gramatical knowledge" from PDT and express it by means of constraints for ConstraitsDependencyParser. I didn't try to design any constraints to conform gramatical rules by hand, except for projectivity constraints.

...wait to learn more.

Handmade grouping of edges

To find out the most common and most frequent rules, I tried to classify all observed edges by incrementally deciding, which types of edges to join.

Observations:

  • Most frequent edges deal with punctuation, apposition, indirect speech, abbreviations, coordination of different types, multi-word prepositions. All these phenomena are either "arbitrary encoding standards" (punctuation, indirect speech etc.) or their complexity exceeds unary constraints (coord. etc.) that I can only cover by this clustering approach.

  • All the "examples of typical linguistic phenomena" such as the adjective-noun congruence are observed in a very sparse manner, so one has to look at many infrequent edges to feel the phenomenon.

Summary

Given a treebank and "no knowledge" of grammar, it's hard even for a human to summarize the positive evidence. It would be then even harder to solve negative evidence ("a noun in genitive didn't connect to the preceding verb, although it generally does"). For a treebank of 10^6 words, summing up of positive evidence is bearable, although time consuming. Searching for negative evidence takes xxx times more time.

-- OndrejBojar - 03 Mar 2003
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback