EACL 2003

11th Conference of the European Chapter of the Association for Computational Linguistics

++ Table of Contents

Conference Dates

  • Date: April 12-17, 2003
  • Location: Budapest, Hungary
  • Registration deadline: 10 November 2002
  • Submission deadline: 15 November
  • Notification of acceptance: January 15, 2003
  • Camera ready papers due: February 10-15, 2003 (TBA exactly)
  • EACL-03 Conference: April 12-17, 2003
  • URLs: EACL, EACL03
  • Paper Title: "Constraint Based Integration of Deep and Shallow Parsing Techniques"
  • Authors: MichaelDaum, KilianAFoth, WolfgangMenzel
  • Maximum Number of Pages: 8 (two collumns, single sided)
  • Summary: "The contribution of different shallow processing components like taggers and chunkers to the performance of a deep syntactic parser have been investigated. This is done by extending Weighted Constraint Dependency Grammars to also take information from external sources into consideration."
  • submission ID: 178
  • sumission to eacl03@ufal.mff.cuni.cz, attach paper as 178.ps, subject line "178"
  • source access: cvs co ecal2003

Reviewer's Email

EACL paper 178
From: Jan Hajic <hajic@ufal.ms.mff.cuni.cz>
To: micha@nats.informatik.uni-hamburg.de

Dear Michael Daum,

Congratulations! We are pleased to let you know that your paper 

178

titled

Constraint Based Integration of Deep and Shallow Parsing Techniques

has been accepted by the EACL'03 Programme Committee to be presented
at this year's EACL. Attached below are comments to the author(s) 
from the reviewers that read your paper. Please read them carefully 
to make the final version of your paper perfect in every respect.

The competition for EACL'03 was unusually high - there were 181
submissions accepted for review, and only 48 could be accepted
to be presented in two parallel sessions.

If your paper has been accepted elsewhere in the meantime, and you
prefer the other venue to publish the paper, please let us know
immediately (at hajic@ufal.mff.cuni.cz).

The final version of your paper must be no more than 8 pages
long. Exact instructions for camera-ready copy of your paper and for
uploading an electronic version of it will be sent to you shortly.

Once more, congratulations, and see you soon at EACL'03 in Budapest,

-- Jan Hajic & Ann Copestake
EACL'03 program committee co-chairs

Ratings

Review Appropriateness Correctness Implications Originality Empirical Grounding Clarity References In or Out
First 5 4 4 4 3 3 5 4
Second 5 4 3 3 4 5 3 3
Third 5 3 2 3 4 3 3 2-3

Explanation of categories

Appropriateness

Appropriateness: Does the paper fit in EACL-03?
        5: Definitely
        4: Probably
        3: Uncertain
        2: Probably not
        1: Certainly not

Correctness

Correctness: Does the paper appear to be flawed technically and/or methodologically?
        5: Impeccable
        4: The paper is OK
        3: Only trivial flaws
        2: Minor flaws that must be corrected
        1: Major flaws that make the paper unsound/inconsistent

Implications

How important is the work?
        5: Will change the future
        4: People will read and cite this paper
        3: Restricted interest
        2: Not of compelling interest
        1: Will have no impact on the field

Originality

How novel is the approach? 
        5: A radically new approach
        4: An innovative use
        3: A new application of well known techniques  
        2: Yet another application of well worn techniques 
        1: Entirely derivative

Empirical Grounding

Does this paper contain information about evaluation?
        5: Excellent evaluation
        4: Good evaluation
        3: Some evaluation
        2: Evaluation is weak
        1: Should have contained some evaluation, but it didn't; 
           or it did but the evaluation was bogus
      N/A: Does not apply

Clarity

Is it clear what was done?
        5: Presentation is very clear
        4: Difficult, but understandable 
        3: Some parts were not clear to me 
        2: Most of the paper is unclear
        1: Presentation is very confusing

References

Is the bibliography relevant and exhaustive? 
        5: Thorough
        4: Pretty good, but a few missing
        3: Some citations, but some missing
        2: Scrappy citations; a lot missing
        1: Virtually no relevant references cited

Out or In

Should the paper be rejected or accepted? 
        5: I would fight to have this paper accepted
        4: I would like this paper accepted
        3: I am undecided
        2: I would like this paper rejected
        1: I would fight to have this paper rejected

First Review

The paper describes the combination of a part-of-speech tagger,
chunker, and constraint-based dependency parser. The authors nicely
show the benefits of bringing these three components
together. However, some points need clarification and the evaluation
requires extension.

1) is the corpus especially created for this investigation? How were
   the sentences selected? Statistics about the corpus would be
   useful: average sentence length, vocabulary size, ...
   The percentage of 29% unknown tokens seems very high.

2) The authors using smoothing for their constaints (section 4).
   How is the smoothing done?

3) The sample constraints in the paper, e.g., 
   {X:SYN} : tagger : [ tag_score(X@id) ] : false; 
   are not interpretable without further description. Since it is
   clear what was done from the text, the authors might also remove
   the examples to gain space for other parts

4) 92.3% for pos tagging and 87.4/82.3% for chunking German seem
   low. The explanations that the STTS makes difficult distinctions
   and that German is difficult in general are not convincing, since
   other publications report higher accuracies for German and STTS.
   
5) The details of ``multi-tagging'' and ``single-tagging'' were not
   clear to me. E.g., in single-tagging, does the parser see the prob
   that is assigned to a tag or is it forced to be 100%? How many tags
   are passed to the parser? What are the recall rates in
   multi-tagging for the tagger, i.e. when including the 2nd, 3rd,
   ..., tag?

Second Review

page (p) 1, beginning 2nd column (c):
" ... in case or ordering preferences." -->
" ... in case of ordering preferences."

p. 1, 2nd c., dots 3 and 4:
arguments also hold for other constraint-/unification-based formalisms!?

p. 1, 2nd c., at the end (weak integration argument):
this can be realized in a procedural framework, but perhaps not that
easily

p. 2, 3, 4/5 (notation of constraints):
I have problems when reading the constraints, please be more verbose
here; you sometimes write {X:SYN}, but later I found {X!SYN}; you say
(first example) subjects TYPICALLY precedes their finite verb, but then
write the constraint penalty (?) as 0.1, hmmm ...

p. 2, section (s) 3:
I would like to see the runtime performance with respect to sentence
length (as in table 1)---the general remark that a sentence is cut
after three minutes is not so significant

beginning p. 3: 
parsing process would benefit from additional information as long as
information can be produced QUICKLY (your text): again, in what time

(sideline question here:
can you foresee that WCDG can be applied in the near future to `real
world problems' or will it play the role of the a more experimental
framework)

p. 3, s. 4, 2nd c:
you say that the integration of the tagger results in a speed up of
3---in single or multi-tagger mode??

p. 4, 1st c:
you say that errors of the tagger are compensated through the combined
evidence of other constraints---how can you control this??

p. 4, s. 5:
I think the trend now rather is that a chunker not only brackets
a structure but also assigns `shallow' information to its substructure

p. 4/5, s. 5:
does the 0.0 in the constraint means absolute certainty??

p. 5, s 5:
again, I would like to see how much the chunker can contribute to the
runtime of the overall system (I guess, a lot)

s. 6: I'm missing other frameworks which also have tried to integrate
additional `shallow' sources, e.g., LFG and HPSG (see, e.g., last
Proc. ACL2002: Crysmann et al., Rietzler et al.)

Third Review

  
The paper describes the integration of tagger and chunker information
into a dependency grammar with weighted constraints. The additional
information prunes the search space, increasing the probability of the
parser to find the optimal parse.

What I am missing in the paper is a discussion of how the weights are
chosen. Presumably, they could be automatically learnt from training
data.

Further comments:

The formal language use in the constraint examples is probably unknown
to most readers.

last paragraph of chapter 2: Introducing constraint weights turns the
CSP into a ... problem, which is ... harder to solve.

If all constraints are local (i.e. depend only on a node and its
daughters), you could apply the Viterbi algorithm to select the best
parse.

Chapter 3:
How did you choose the test data?
How do you deal with unknown words?
Which result does the parser return when it is stopped because of a time-out?
How do you choose the weights and what are the effects of different
weighting schemes?

Instructions for the camera-ready version

(Email from Patrick Paroubek eacl03@limsi.fr on Fri, 24 Jan 2003)

Final Versions of EACL-2003 Conference Papers

NB. These instructions are for authors of papers accepted to EACL-2003 conference. Authors for papers for workshops, please contact the corresponding organizers.

  1. Authors must print the attached Copyright Transfer Agreement (available at URL http://www.limsi.fr/Recherche/CORVAL/eacl03/copyright.html), complete it and return it along with their final camera-ready copies to the address given below.
  2. Papers should be prepared according to the following guidelines (also available in the style sample files at http://www.limsi.fr/Recherche/CORVAL/eacl03):
    • All authors are required to adhere to these specifications. Since the proceedings will appear in hardcopy and electronic form, authors are required to provide a camera-ready and a Portable Document Format (PDF) version of their papers. The hardcopy must be produced with a laser printer at 300 dpi resolution or better, printed on A4 (210mm x 297mm) paper.

    • General Instructions. Manuscripts must be in two-column format. Exceptions to the two-column format include the title, authors' names and complete addresses, which must be centered at the top of the first page, and any full-width figures or tables. Type single-spaced. Use only one side of the page. Start all pages directly under the top margin. See the guidelines later regarding formatting the first page (see "First page" below). If the paper is produced by a printer, make sure that the quality of the output is dark enough to photocopy well. It may be necessary to have your laser printer adjusted for this purpose.

      Do not print page numbers on the manuscript. Write them lightly on the back of each page in the upper left corner along with the (first) author's name. The maximum length of a manuscript is eight (8) pages for the main conference, printed singlesided.

    • Electronically-available resources ACL provides this description in LATEX2e (eacl2003.tex) and PDF format (eacl2003.pdf), along with the LATEX2e style file used to format it (eacl2003.sty) and an ACL bibliography style (acl.bst). These files are all available at http://www.limsi.fr/Recherche/CORVAL/eacl03. A Microsoft Word template file (eacl2003.dot) is also available at the same URL. We strongly recommend the use of these style files, which have been appropriately tailored for the EACL-2003 proceedings. For reasons of uniformity, Adobe's Times Roman font should be used. In LATEX2e this is accomplished by putting
      \usepackage{times} 
      \usepackage{latexsym}
      
      in the preamble.

    • For the production of the electronic manuscript you must use Adobe's Portable Document Format (PDF). This format can be generated from postscript files: on Unix systems, you can use ps2pdf for this purpose; under Microsoft Windows, Adobe's Distiller can be used. Note that some word processing programs generate PDF which may not include all the necessary fonts (esp. tree diagrams, symbols). When you print or create the PDF file, there is usually an option in your printer setup to include none, all or just nonstandard fonts. Please make sure that you select the option of including ALL the fonts. Before sending it, test your PDF by printing it from a computer different from the one where it was created. Moreover, some word processor may generate very large postscript/PDF files, where each page is rendered as an image. Such images may reproduce poorly. In this case, try alternative ways to obtain the postscript and/or PDF. One way on some systems is to install a driver for a postscript printer, send your document to the printer specifying "Output to a file", then convert this file to PDF. Additionally, it is of utmost importance to specify the A4 format (210mm x 297mm) when formatting the paper. When working with dvips, for instance, one should specify -t a4. Print-outs of the PDF file on A4 paper should be identical to the hardcopy version. If you cannot meet the above requirements about the production of your electronic submission, please contact the publication chair Patrick Paroubek (eacl03@limsi.fr) as soon as possible.

    • Layout Print manuscript two columns to a page, in the manner these instructions are printed. The exact dimensions for a page on A4 paper are:
      • Left margin of left page: 15mm
      • Right margin of left page: 35mm
      • Left margin of right page: 35mm
      • Right margin of right page: 15mm
      • Top margin:31mm
      • Bottom margin: 36mm
      • Columns width: 77mm
      • Column height: 230mm
      • Gap between columns: 6mm
      Exceptionally, authors for whom it is impossible to print on A4 paper may use US Letter paper. In this case, they should keep the top and left margins as given above, use the same column width, height and gap, and modify the bottom and right margins as necessary. Note that the text will no longer be centered.

    • The First Page. Center the title, author's name(s) and affiliation(s) across both columns. Do not use footnotes for affiliations. Do not include the paper ID number that was assigned during the submission process. Use the two-column format only when you begin the abstract.
      • Title
      • Place the title centered at the top of the first page, in a 15 point bold font. Long title should be typed on two lines without a blank line intervening. Put the title at 25mm from the top of the page, followed by a blank line, then the author's names(s), and the affiliation on the following line. Do not use only initials for given names (middle initials are allowed). The affiliation should contain the author's complete address, and if possible an electronic mail address. Leave about 20mm between the affiliation and the body of the first page.
      • Abstract: Type the abstract at the beginning of the first column. The width of the abstract text should be smaller than the width of the columns for the text in the body of the paper by about 6mm on each side. Center the word Abstract in a 12 point bold font above the body of the abstract. The abstract should be a concise summary of the general thesis and conclusions of the paper. It should be no longer than 200 words.
      • Text: Begin typing the main body of the text immediately after the abstract, observing the twocolumn format as shown in the present document. Type single spaced. Use standard fonts such as Times Roman or Computer Modern Roman, 11 to 12 points for text, 14 to 16 points for headings and title.
      • Indent when starting a new paragraph. For reasons of uniformity, use Adobe's Times Roman fonts, with 11 points for text and subsection headings, 12 points for section headings and 15 points for the title. If Times Roman is unavailable, use Computer Modern Roman (LATEX2e's default). Note that the latter is about 10% less dense than Adobe's Times Roman font.

    • Sections
      • Headings: Type and label section and subsection headings in the style shown on the present document. Use numbered sections (Arabic numerals) in order to facilitate cross references. Number subsections with the section number and the subsection number separated by a dot, in Arabic numerals. Do not number subsubsections.
      • Citations: Follow the "Guidelines for Formatting Submissions" to Computational Linguistics that appears in the first issue of each volume, if possible. That is, citations within the text appear in parentheses as (Gusfield, 1997) or, if the author's name appears in the text itself, as Gusfield (1997). Append lowercase letters to the year in cases of ambiguities. Treat double authors as in (Aho and Ullman., 1972), but write as in (Chandra et al., 1981) when more than two authors are involved. Collapse multiple citations as in (Gusfield, 1997; Aho and Ullman., 1972).

      • References: Gather the full set of references together under the heading References; place the section before any Appendices, unless they contain references. Arrange the references alphabetically by first author, rather than by order of occurrence in the text. Provide as complete a citation as possible, using a consistent format, such as the one for Computational Linguistics or the one in the Publication Manual of the American Psychological Association (Association, 1983). Use of full names for authors rather than initials is preferred. A list of abbreviations for common computer science journals can be found in the ACM Computing Reviews (for Computing Machinery, 1983). The provided LATEX and BibTEX style files roughly fit the American Psychological Association format, allowing regular citations, short citations and multiple citations as described above.

      • Appendices: Appendices, if any, directly follow the text and the references. Letter them in sequence and provide an informative title: Appendix A. Title of Appendix.

      • Acknowledgement sections should go as a last section immediately before the references. Do not number the acknowledgement section.
    • Graphics
      • Illustrations: Place figures, tables, and pictures in the paper near where they are first discussed, rather than at the end, if possible. Wide illustrations may run across both columns. Do not use color illustrations as they may reproduce poorly.

      • Captions: Provide a caption for every illustration; number each one sequentially in the form: "Figure 1. Caption of the Figure." "Table 1. Caption of the Table." Type the captions of the figures and tables below the body, using 11 point text.

      • Length of submission: eight pages (8) is the maximum length of papers for the EACL-2003 main conference. All illustrations, references, and appendices must be accommodated within these page limits, observing the formatting instructions given in the present document. Papers that do not conform to the specified length and formatting requirements run the risk not to be included in the proceedings.
        Up to two (2) additional pages may be purchased from ACL at the price of 250 EUR per page; please include a check payable to the Association for Computational Linguistics along with your camera ready copies.

  3. Be warned that use of the style file provided by EACL03, on some systems, may produce incorrect results. Therefore, verify that the margins and font size of your output conform to the guidelines. If they do not, you are responsible for making the appropriate adjustments.

  4. Three copies of your final, camera-ready paper must reach us no later than February 15th, 2003. In order to guarantee that your paper is included in the final proceedings, we ask that you take care to ensure that the paper is RECEIVED by the February 15th, 2003. The mailing address:

    EACL-2003 Paper c/o Patrick Paroubek
    LIMSI - CNRS
    Bâtiment 508
    Université Paris XI
    BP 133
    91403 ORSAY Cedex
    FRANCE
    You can use (33) (0)1 69 85 80 58 if the courier service you are using requires a phone number to be listed.

  5. The EACL proceedings will also be available on a CDROM. Therefore, in addition to the three hardcopies, you are also required to provide a pdf file of your paper. The authors are responsible for ensuring that the suitable fonts are included, when necessary, in preparing the pdf file (see format instructions above). If you are unable to produce a pdf file, please let us know immediately (eacl03@limsi.fr).

    The pdf file for the CD-ROM proceedings must also be submitted on or before February 15th, 2003. The submission of the pdf file can be done by (in order of preference):

    • Sending email to eacl03@limsi.fr with the URL of the site from which we can download the pdf file.
    • Send the pdf file via email (addressed to eacl03@limsi.fr).

  6. N.B. Please send an email message to eacl03@limsi.fr giving us the name and email address of the author we can contact during the period February 15th, 2003 through March 10, 2003. We may need to contact you in case there are any problems we notice with the submission of the hardcopies or the pdf file.
     
     

    Most recent author: Patrick Paroubek (made from K. Vijay-Shanker's ACL 2000 version).
     
     

    Kilian's Feedback from the Conference

    Subject: I'm back
    Date: Today 08:40:20
    From: "Kilian A. Foth" 
    

    I have returned from the 10th EACL (read the preface of the proceedings for full details on the occasionally disputed numbering) and brought with me the proceedings, which reside here.

    Our own contribution had a friendly hearing. Here is the Q&A session, as well as I remember it:

    QUESTIONER 1: I would like to know, how do you determine the weights of your different constraints?

    ME: Well, the short answer is, We make them up. The long answer is, We tried optimizing the weights with genetic algorithms, and that gives about the same performance, but it takes very long to compute weights that way, so we stopped doing that.

    QUESTIONER 2: I wanted to ask exactly the same thing...

    QUESTIONER 3: Have you considered statistic means instead?

    ME: You mean, count them from a corpus? Well, so far we had no big annotated reference corpus, so the question did not arise. [I actually think we should try this, however.]

    QUESTIONER 4: I was a bit surprised about the tagger performance you cited. I think the authors of TnT report about 97% correctness...

    ME: That's true - we were surprised as well. Could be that our corpus was too systematically different from the training corpus - more unknown company names, etc. Anyway, our goal was not having great tagging performance, but showing how useful it is even if it isn't great.

    SESSION CHAIR: I was intrigued by your saying that you did genetic experiments. What kind of reference corpus did you have for that, then?

    ME: At the time we were using only 220 sentences from Verbmobil, and for that we easily created the annotations ourselves.

    Annotations on other talks

    Many of the other contributions were thought-provoking. I refer you in particular to the contributions of Messieurs Schiehlen and Kepser.

    Here are some annotations of talks I heard:

    Curin, Cmejrek, Havelka: Czech-English Dependency Tree-based Machine Translation

    This describes a fully automatic Czech-to-English text translation system that achieves BLEU scores of up to 0.20 (a human translator achieves ~0.55). Interesting is the use of both `analytic' (surface-oriented) dependency trees and `tectogrammatical' structure, which aims to abstract from surface phenomena and appears to be a very fashionable device at the moment.

    James R. Curran, Stephen Clark: Investigating GIS and Smoothing for Maximum Entropy Taggers

    The authors apply ME models to POS tagging with both the standard PTB tagset and `lexical types' (from Combinatory Categorial Grammar). The problem with ME models is that the expected value required for determining the model parameters is usually too expensive to calculate, so that approximation must be used. This contribution shows that the popular Generalised Iterative Scaling algorithm often used for that purpose can be made substantially simpler with no quality loss. It also compares different methods of smoothing the object function and recommends a Gaussian prior over the usual simple cutoff.

    Markus Dickinson, Detmar Meurers: Detecting Errors in Part-of-Speech Annotation

    Even allegedly `gold-standard' POS annotations are known to have many errors in them. Methods are proposed for detecting some classes of errors:

    1. if the same stretch of text occurs several times in a tree bank with different tags for a particular word, one of the two instances is likely wrong. 4417 errors in the WSJ corpus were found this way.
    2. closed classes only contain a finite number of words - every other word bearing this tag is an error. 94 errors of this extremely obvious kind were found in the WSJ corpus.
    3. when the annotator's guide gives explicit examples of how to tag particular patterns, one can easily check whether they were, in fact, tagged like this. As an example, 2466 instances of hyphenated pre-noun modifiers were mistagged as NN when the guide explicitly says they are always JJ.

    More interesting is a reference to work by Kveton & Oliva (2002) who are reported to have found 2661 errors in the much smaller NEGRA corpus. We definitely should try to find out whether these corrections are available.

    James Henderson: Neural Network Probability Estimation for Broad Coverage Parsing

    This describes yet another statistical parser of English with an F-measure of 89% on the Penn Treebank. (The authors explicitly say that part of their achievement is to `add to the diversity of available broad coverage parsing methods'.) A statistical left-corner parser is used, with the innovation that the derivation history is not represented by hand-crafted features; instead a neural network automatically induces features from the unbounded parse history.

    Kehagias, Fragkou, Petridis: Linear Text Segmentation using a Dynamic Programming Algorithm

    A simple sentence similarity measure and a dynamic programming algorithm are used to segment text automatically, i.e. to compute both the number and the extent of cohesive segments. The most difficult part seems to be defining a good evaluation measure, since a `near miss' should regarded as better than a complete miss.

    Stephan Kepser: Finite Structure Query: A Tool for Querying Syntactically Annotated Corpora

    We absolutely must have this tool. Please download it from tcl.sfs.uni-tuebingen.de/fsq immediately. Basically, it is a Java application that supports full first-order logic for querying tree banks in the NEGRA export format -- quite exactly what we need. The TIGERSearch language does not even allow you to ask, `How many verb phrases don't have a subject?' This tool does. There is no integrated tree viewer, though.

    Manios, Nenadic, Spasic, Ananiadou: An Integrated Term-Based Corpus Query System

    The authors claim these advances over previous automatic retrieval systems:
    • dynamic acquisition of terminological knowledge (i.e. technical terms)
    • XML representation of terminology processing results
    • query composition GUI with the particular operations requested by domain experts

    Paola Merlo: Generalised PP-attachment Disambiguation Using Corpus-based Linguistic Diagnostics

    The classical experiment in PP attachment poses a somewhat artificial binary problem for the classifier: distinguishing between verb attachment and noun. This paper also models the much harder, but important distinction between adjunct PPs and argument PPs (like our `PP' vs. `OBJP'). A surprising number of linguistic indicators of `argumenthood' can be directly detected in a large corpus, e.g. optionality, repeatability, etc.

    The direct four-way classification works better than successive binary classification, which suggests that the current formulation of the task is not adequate for useful application (my interpretation). The four-way distinction is made with an accuracy of 74%.

    Peng, Schuurmans, Keselj, Wang: Language Independent Authorship Attribution with Character Level N-Grams

    Authorship attribution is a highly speculative and somewhat doubtful branch of CL, since where it really would count, i.e. in court, it usually deals with specimens where the author is actively working against the experimenter. However, good successes have been achieved `in the lab', i.e. on samples of different author's normal writing. Usually, particular linguistic style markers are extracted from a disputed text and then used for some classification algorithm, but the choice of features is somewhat of a black art. This paper proposes a simple n-gram similarity model on the level of characters and shows that it works well on undisputed texts.

    (So far, all related work was on the level of words, but for ideographic languages the tokenization of running text is already a very difficult problem. Thus, when the authors claim to be language-independent, they really mean `it also works for Chinese'.)

    Gerald Penn: AVM Description Compilation using Types as Modes

    This entire paper is about techniques for compiling efficient code from descriptions in typed feature logic (e.g. for HPSG). There is an ongoing struggle between proponents of readable or appropriate formalism elements and detractors who say that they cannot possibly justify the extra cost, and that you should never ever use, say, disjunctive feature structures.

    Much more informative than the written version was the talk itself, and yet more informative was the exchange afterwards during question time, which went about as follows:

    LISTENER: You've talked about how to compile typed feature structures into more efficient code, but isn't the typing itself the source of much of the inefficiency you are battling? Now, I work in lexical-functional grammar...

    PENN: Yes, I know...

    LISTENER: ...and there you simply don't have these problems.

    PENN: That's true, but I consider the enormous advantage in understandability of amply outweigh that disadvantage. If you want to build a realistic grammar of a whole language you absolutely need to make it as easy for the grammar writer as possible...

    LISTENER: But there are two HPSG grammars that aim to model all of English, and one LFG.

    PENN: Yes, I know, and it's kind of a thorn in my side that such things can exist at all... but the real question is, Do you want only me and you and Dan Flickinger to be able to write a real-life grammar, or any well-trained computational linguist?

    Gerald Penn & Mohammed Haji-Abdolhosseini: Topological Parsing

    Topological parsing is another very active item at the moment. Penn himself says it best in his abstract: "...these grammars contain two primitive notions of constituency that are used to preserve the semantic or interpretational aspects of phrase structure while at the same time providing a more efficient backbone for parsing based on word-order and contiguency constraints."

    The contrast is with normal CFG, where constituency and order must be expressed simultaneously, which is not appropriate when the word order is freer than in English. (One wonders how much damage has been done to computational linguistics by the simple fact that Chomsky was American.)

    Here a particular formalization of tectogrammatical vs. phenogrammatical structure is proposed, including a special parsing algorithm that proceeds top-down as well as bottom-up.

    Judita Preiss: Using Grammatical Relations to Compare Parsers

    It is argued that comparing different parsers is difficult because their output trees may look very different even for the same sentence. Instead, it is measured how well they find particular grammatical relations (subj, dobj, etc.) in the Suzanne corpus. Following Lin (1995), all parser output is mapped into (GR/head/dependent) triples that are exactly equivalent to labelled dependency edges, although that term is not used. Surprisingly, Buchholz's shallow GR finder does better than the state-of-the art statistical parsers (Carroll, Charniak, Collins etc.). When using this information for anaphora resolution, the gap narrows quite a bit, apparently because the error is more often with the resolution algorithm than with the parsed input.

    Karl-Michael Schneider: A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering

    This is nothing more than the testing of two Bayesian spam filters: a multivariate event model and a multinomial event model. Actually, this only means that you either distinguish the presence or absence of each word in a message, or the exact number of occurrences, and the latter method is very slightly more effective.

    Sabine Schulte im Walde: Experiments on the Choice of Features for Learning Verb Classes

    This work tries to cluster German verbs into semantic groups automatically by three different types of features: syntactic frames, preposition preference, and selectional preference. GermaNet's top semantic classes are used for the last method. Although some plausible groupings can be achieved, the final result can only serve as the basis of manual classification. It is hypothesized that the idiosyncratic features of some verbs impose a limit on the usefulness of any set of features for automatic classification.

    Steedman, Sarkar, Osborne, Hwa, Clark, Hockenmaier, Ruhlen, Baker, Crim: Bootstrapping statistical parsers from small datasets

    The idea of co-training is investigated, where two statistical parsers are used to generate additional training material for each other. This method achieves better results than self-training, where the parser's own output structures are used as further training material. However, the output from a rival tagger is still an order of magnitude less useful for training than a proper treebank, i.e. 100 `golden' trees help training more than 1000 co-trained ones.

    Research notes

    John Barnden, Sheila Glasbey, Mark Lee, Alan Wallington: Domain-transcending mappings in a system for metaphorical reasoning

    Metaphor involves the mapping of a set of concepts from a source domain to a target domain. Example: "We're driving in the fast lane on the freeway of love." This common metaphor establishes the mappings love->journey, lovers->travellers, relationship->vehicle. But what does the excitement of being in the fast lane map to? The authors argue that it maps to itself as a view-neutral mapping adjunct, namely that of mental state. They discuss various of these VNMAs and implement some of them in a system for defeasible metaphorical

    Luisa Bentivogli, Emanuele Pianta: Beyond Lexical Units: Enriching WordNets with Phrasets

    WordNet can only represent lexical items (car), idiomatic expressions (kick the bucket) and restricted collocations (criminal record) in its synsets. Free combinations (fast car) are not stored because their meanings are purely compositional.

    But many of these combinations are very frequent and very helpful e.g. for word-sense disambiguation (in `fast car', `car' can only mean `automobile', not `railway carriage'). Also, lexical gaps (there is no Italian word for `paperboy') are frequently filled with such recurrent free phrases. Therefore the authors augment the WordNet formalism with phrasets to store them.

    Beate Dorow, Dominic Widdows: Discovering Corpus-Specific Word Senses

    Conjunctions of nouns in free text are used to group words into clusters with similar meanings. The employed algorithm is able to detect that `mouse' is ambiguous between `rodent' and `input device' and correctly assigns both meanings to the correct WordNet superclass.

    Toshiaki Fujiki, Hidetsugu Nanba, Manabu Okumura: Automatic Acquisition of Script Knowledge from a Text Collection

    Scripts are the descriptions of typical event sequences in a particular domain (e.g., arrive--choose--receive--eat--pay--leave). Here, script knowledge is extracted from newscasts by noting successive sentences, coordinated phrases, or subclauses with the same subject and object. The authors extract sequences such as find--arrest--prosecute. Although the paper does not mention this, the accuracy is really only about 50%.

    Birte Lönneker, Primo Jakopin: Contents and evaluation of the first Slovenian-German online dictionary

    This is simply the contents of the textbook that Birte learnt Slovenian from, annotated and digitized. Altogether about 70% of the more frequent tokens in typical newspaper text are covered.

    James McCracken, Adam Kilgarriff: Oxford Dictionary of English - current developments

    "In order for a non-formalised, natural-language dictionary like ODE to become properly accessible to computational processing, the dictionary content must be positioned within a formalism which explicitly enumerates and classifies all the information that the dictionary content itself merely assumes, implies, or refers to." The Oxford dictionary is the one that professional writers use to get the whole story, and then some. In preparation to making their definitions available to selected CL projects, the editors are adding supplementary data such as explicit morphology, collocations and domain labels semi-automatically.

    Horacio Saggion, Katerina Pastra, Yorick Wilks: NLP for Indexing and Retrieval of Captioned Photographs

    The criminal profiler, hero of much current TV fare, is usually represented as sitting in an office, patiently browsing through album after album of evidence from crime scenes, searching for patterns that will allow him to connect past and present cases, and unfortunately, this picture is largely accurate. This work aims to ease his job by parsing and analysing the handwritten photo captions, and adding limited inferring capability.

    Michael Schiehlen: A Cascaded Finite-State Parser for German

    This disturbing work cites a pure FST parser that achieves much the same results on the NEGRA corpus as we do.

    Zdenek Zabokrtsky, Otakar Smrz: Arabic Syntactic Trees: from Constituency to Dependency

    The first large-scale dependency tree bank of Arabic is being created, largely by converting existing phrase structure annotations. The authors describe work in progress that achieves about 60% structural correctness.

    Related Topics: ChunkerExperiments, TreeChunkerReport, HeiseGrammar, http://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/EACL

    -- MichaelDaum - 01 Nov 2002
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback