EACL paper 178 From: Jan Hajic <hajic@ufal.ms.mff.cuni.cz> To: micha@nats.informatik.uni-hamburg.de Dear Michael Daum, Congratulations! We are pleased to let you know that your paper 178 titled Constraint Based Integration of Deep and Shallow Parsing Techniques has been accepted by the EACL'03 Programme Committee to be presented at this year's EACL. Attached below are comments to the author(s) from the reviewers that read your paper. Please read them carefully to make the final version of your paper perfect in every respect. The competition for EACL'03 was unusually high - there were 181 submissions accepted for review, and only 48 could be accepted to be presented in two parallel sessions. If your paper has been accepted elsewhere in the meantime, and you prefer the other venue to publish the paper, please let us know immediately (at hajic@ufal.mff.cuni.cz). The final version of your paper must be no more than 8 pages long. Exact instructions for camera-ready copy of your paper and for uploading an electronic version of it will be sent to you shortly. Once more, congratulations, and see you soon at EACL'03 in Budapest, -- Jan Hajic & Ann Copestake EACL'03 program committee co-chairs
Review | Appropriateness | Correctness | Implications | Originality | Empirical Grounding | Clarity | References | In or Out |
---|---|---|---|---|---|---|---|---|
First | 5 | 4 | 4 | 4 | 3 | 3 | 5 | 4 |
Second | 5 | 4 | 3 | 3 | 4 | 5 | 3 | 3 |
Third | 5 | 3 | 2 | 3 | 4 | 3 | 3 | 2-3 |
Appropriateness: Does the paper fit in EACL-03? 5: Definitely 4: Probably 3: Uncertain 2: Probably not 1: Certainly not
Correctness: Does the paper appear to be flawed technically and/or methodologically? 5: Impeccable 4: The paper is OK 3: Only trivial flaws 2: Minor flaws that must be corrected 1: Major flaws that make the paper unsound/inconsistent
How important is the work? 5: Will change the future 4: People will read and cite this paper 3: Restricted interest 2: Not of compelling interest 1: Will have no impact on the field
How novel is the approach? 5: A radically new approach 4: An innovative use 3: A new application of well known techniques 2: Yet another application of well worn techniques 1: Entirely derivative
Does this paper contain information about evaluation? 5: Excellent evaluation 4: Good evaluation 3: Some evaluation 2: Evaluation is weak 1: Should have contained some evaluation, but it didn't; or it did but the evaluation was bogus N/A: Does not apply
Is it clear what was done? 5: Presentation is very clear 4: Difficult, but understandable 3: Some parts were not clear to me 2: Most of the paper is unclear 1: Presentation is very confusing
Is the bibliography relevant and exhaustive? 5: Thorough 4: Pretty good, but a few missing 3: Some citations, but some missing 2: Scrappy citations; a lot missing 1: Virtually no relevant references cited
Should the paper be rejected or accepted? 5: I would fight to have this paper accepted 4: I would like this paper accepted 3: I am undecided 2: I would like this paper rejected 1: I would fight to have this paper rejected
The paper describes the combination of a part-of-speech tagger, chunker, and constraint-based dependency parser. The authors nicely show the benefits of bringing these three components together. However, some points need clarification and the evaluation requires extension. 1) is the corpus especially created for this investigation? How were the sentences selected? Statistics about the corpus would be useful: average sentence length, vocabulary size, ... The percentage of 29% unknown tokens seems very high. 2) The authors using smoothing for their constaints (section 4). How is the smoothing done? 3) The sample constraints in the paper, e.g., {X:SYN} : tagger : [ tag_score(X@id) ] : false; are not interpretable without further description. Since it is clear what was done from the text, the authors might also remove the examples to gain space for other parts 4) 92.3% for pos tagging and 87.4/82.3% for chunking German seem low. The explanations that the STTS makes difficult distinctions and that German is difficult in general are not convincing, since other publications report higher accuracies for German and STTS. 5) The details of ``multi-tagging'' and ``single-tagging'' were not clear to me. E.g., in single-tagging, does the parser see the prob that is assigned to a tag or is it forced to be 100%? How many tags are passed to the parser? What are the recall rates in multi-tagging for the tagger, i.e. when including the 2nd, 3rd, ..., tag?
page (p) 1, beginning 2nd column (c): " ... in case or ordering preferences." --> " ... in case of ordering preferences." p. 1, 2nd c., dots 3 and 4: arguments also hold for other constraint-/unification-based formalisms!? p. 1, 2nd c., at the end (weak integration argument): this can be realized in a procedural framework, but perhaps not that easily p. 2, 3, 4/5 (notation of constraints): I have problems when reading the constraints, please be more verbose here; you sometimes write {X:SYN}, but later I found {X!SYN}; you say (first example) subjects TYPICALLY precedes their finite verb, but then write the constraint penalty (?) as 0.1, hmmm ... p. 2, section (s) 3: I would like to see the runtime performance with respect to sentence length (as in table 1)---the general remark that a sentence is cut after three minutes is not so significant beginning p. 3: parsing process would benefit from additional information as long as information can be produced QUICKLY (your text): again, in what time (sideline question here: can you foresee that WCDG can be applied in the near future to `real world problems' or will it play the role of the a more experimental framework) p. 3, s. 4, 2nd c: you say that the integration of the tagger results in a speed up of 3---in single or multi-tagger mode?? p. 4, 1st c: you say that errors of the tagger are compensated through the combined evidence of other constraints---how can you control this?? p. 4, s. 5: I think the trend now rather is that a chunker not only brackets a structure but also assigns `shallow' information to its substructure p. 4/5, s. 5: does the 0.0 in the constraint means absolute certainty?? p. 5, s 5: again, I would like to see how much the chunker can contribute to the runtime of the overall system (I guess, a lot) s. 6: I'm missing other frameworks which also have tried to integrate additional `shallow' sources, e.g., LFG and HPSG (see, e.g., last Proc. ACL2002: Crysmann et al., Rietzler et al.)
The paper describes the integration of tagger and chunker information into a dependency grammar with weighted constraints. The additional information prunes the search space, increasing the probability of the parser to find the optimal parse. What I am missing in the paper is a discussion of how the weights are chosen. Presumably, they could be automatically learnt from training data. Further comments: The formal language use in the constraint examples is probably unknown to most readers. last paragraph of chapter 2: Introducing constraint weights turns the CSP into a ... problem, which is ... harder to solve. If all constraints are local (i.e. depend only on a node and its daughters), you could apply the Viterbi algorithm to select the best parse. Chapter 3: How did you choose the test data? How do you deal with unknown words? Which result does the parser return when it is stopped because of a time-out? How do you choose the weights and what are the effects of different weighting schemes?
NB. These instructions are for authors of papers accepted to EACL-2003 conference. Authors for papers for workshops, please contact the corresponding organizers.
\usepackage{times} \usepackage{latexsym}in the preamble.
Three copies of your final, camera-ready paper must reach us no later than February 15th, 2003. In order to guarantee that your paper is included in
the final proceedings, we ask that you take care to ensure that the paper
is RECEIVED by the February 15th, 2003. The mailing address:
The EACL proceedings will also be available on a CDROM. Therefore, in addition to the three hardcopies, you are also required to provide a pdf file of your paper. The authors
are responsible for ensuring that the suitable fonts are included, when
necessary, in preparing the pdf file (see format instructions above). If you are unable to produce a pdf
file, please let us know immediately (eacl03@limsi.fr).
The pdf file for the CD-ROM proceedings must also
be submitted on or before February 15th, 2003. The submission of the pdf file
can be done by (in order of preference):
Most recent author:
Patrick Paroubek (made from K. Vijay-Shanker's ACL 2000 version).
EACL-2003 Paper c/o Patrick Paroubek
You can use (33) (0)1 69 85 80 58 if the courier service
you are using requires a phone number to be listed.
LIMSI - CNRS
Bâtiment 508
Université Paris XI
BP 133
91403 ORSAY Cedex
FRANCE
Kilian's Feedback from the Conference
Subject: I'm back
Date: Today 08:40:20
From: "Kilian A. Foth"
I have returned from the 10th EACL (read the preface of the
proceedings for full details on the occasionally disputed numbering)
and brought with me the proceedings, which reside
here.
Our own contribution had a friendly hearing. Here is the Q&A session,
as well as I remember it:
QUESTIONER 1: I would like to know, how do you determine the
weights of your different constraints?
ME: Well, the short answer is, We make them up. The long
answer is, We tried optimizing the weights with genetic
algorithms, and that gives about the same performance, but
it takes very long to compute weights that way, so we
stopped doing that.
QUESTIONER 2: I wanted to ask exactly the same thing...
QUESTIONER 3: Have you considered statistic means instead?
ME: You mean, count them from a corpus? Well, so far we had
no big annotated reference corpus, so the question did not
arise. [I actually think we should try this, however.]
QUESTIONER 4: I was a bit surprised about the tagger
performance you cited. I think the authors of TnT report
about 97% correctness...
ME: That's true - we were surprised as well. Could be that
our corpus was too systematically different from the
training corpus - more unknown company names, etc. Anyway,
our goal was not having great tagging performance, but
showing how useful it is even if it isn't great.
SESSION CHAIR: I was intrigued by your saying that you did
genetic experiments. What kind of reference corpus did you
have for that, then?
ME: At the time we were using only 220 sentences from
Verbmobil, and for that we easily created the annotations
ourselves.
Annotations on other talks
Many of the other contributions were thought-provoking. I refer you
in particular to the contributions of Messieurs Schiehlen and Kepser.
Here are some annotations of talks I heard:
Curin, Cmejrek, Havelka: Czech-English Dependency Tree-based Machine Translation
This describes a fully automatic Czech-to-English text translation
system that achieves BLEU scores of up to 0.20 (a human translator
achieves ~0.55). Interesting is the use of both `analytic'
(surface-oriented) dependency trees and `tectogrammatical' structure,
which aims to abstract from surface phenomena and appears to be a very
fashionable device at the moment.
James R. Curran, Stephen Clark: Investigating GIS and Smoothing for Maximum Entropy Taggers
The authors apply ME models to POS tagging with both the standard PTB
tagset and `lexical types' (from Combinatory Categorial Grammar). The
problem with ME models is that the expected value required for
determining the model parameters is usually too expensive to
calculate, so that approximation must be used. This contribution shows
that the popular Generalised Iterative Scaling algorithm often used
for that purpose can be made substantially simpler with no quality
loss. It also compares different methods of smoothing the object
function and recommends a Gaussian prior over the usual simple cutoff.
Markus Dickinson, Detmar Meurers: Detecting Errors in Part-of-Speech Annotation
Even allegedly `gold-standard' POS annotations are known to have many
errors in them. Methods are proposed for detecting some classes of
errors:
More interesting is a reference to work by Kveton & Oliva (2002) who
are reported to have found 2661 errors in the much smaller NEGRA
corpus. We definitely should try to find out whether these
corrections are available.
James Henderson: Neural Network Probability Estimation for Broad Coverage Parsing
This describes yet another statistical parser of English with an
F-measure of 89% on the Penn Treebank. (The authors explicitly say
that part of their achievement is to `add to the diversity of
available broad coverage parsing methods'.) A statistical left-corner
parser is used, with the innovation that the derivation history is not
represented by hand-crafted features; instead a neural network
automatically induces features from the unbounded parse history.
Kehagias, Fragkou, Petridis: Linear Text Segmentation using a Dynamic Programming Algorithm
A simple sentence similarity measure and a dynamic programming
algorithm are used to segment text automatically, i.e. to compute both
the number and the extent of cohesive segments. The most difficult
part seems to be defining a good evaluation measure, since a `near
miss' should regarded as better than a complete miss.
Stephan Kepser: Finite Structure Query: A Tool for Querying Syntactically Annotated Corpora
We absolutely must have this tool. Please download it from
tcl.sfs.uni-tuebingen.de/fsq immediately. Basically, it is a Java
application that supports full first-order logic for querying tree
banks in the NEGRA export format -- quite exactly what we need. The
TIGERSearch language does not even allow you to ask, `How many verb
phrases don't have a subject?' This tool does. There is no integrated
tree viewer, though.
Manios, Nenadic, Spasic, Ananiadou: An Integrated Term-Based Corpus Query System
The authors claim these advances over previous automatic retrieval
systems:
Paola Merlo: Generalised PP-attachment Disambiguation Using Corpus-based Linguistic Diagnostics
The classical experiment in PP attachment poses a somewhat artificial
binary problem for the classifier: distinguishing between verb
attachment and noun. This paper also models the much harder, but
important distinction between adjunct PPs and argument PPs (like our
`PP' vs. `OBJP'). A surprising number of linguistic indicators of
`argumenthood' can be directly detected in a large corpus, e.g.
optionality, repeatability, etc.
The direct four-way classification works better than
successive binary classification, which suggests that the current
formulation of the task is not adequate for useful application (my
interpretation). The four-way distinction is made with an accuracy of
74%.
Peng, Schuurmans, Keselj, Wang: Language Independent Authorship Attribution with Character Level N-Grams
Authorship attribution is a highly speculative and somewhat doubtful
branch of CL, since where it really would count, i.e. in court, it
usually deals with specimens where the author is actively working
against the experimenter. However, good successes have been achieved
`in the lab', i.e. on samples of different author's normal writing.
Usually, particular linguistic style markers are extracted from a disputed
text and then used for some classification algorithm, but the choice of
features is somewhat of a black art. This paper proposes a simple
n-gram similarity model on the level of characters and shows that it
works well on undisputed texts.
(So far, all related work was on the level of words, but for
ideographic languages the tokenization of running text is already a
very difficult problem. Thus, when the authors claim to be
language-independent, they really mean `it also works for Chinese'.)
Gerald Penn: AVM Description Compilation using Types as Modes
This entire paper is about techniques for compiling efficient code
from descriptions in typed feature logic (e.g. for HPSG). There is an
ongoing struggle between proponents of readable or appropriate
formalism elements and detractors who say that they cannot possibly
justify the extra cost, and that you should never ever use, say,
disjunctive feature structures.
Much more informative than the written version was the talk itself,
and yet more informative was the exchange afterwards during question
time, which went about as follows:
LISTENER: You've talked about how to compile typed feature
structures into more efficient code, but isn't the typing
itself the source of much of the inefficiency you are
battling? Now, I work in lexical-functional grammar...
PENN: Yes, I know...
LISTENER: ...and there you simply don't have these problems.
PENN: That's true, but I consider the enormous advantage in
understandability of amply outweigh that disadvantage. If
you want to build a realistic grammar of a whole language
you absolutely need to make it as easy for the grammar
writer as possible...
LISTENER: But there are two HPSG grammars that aim to model
all of English, and one LFG.
PENN: Yes, I know, and it's kind of a thorn in my side that
such things can exist at all... but the real question is, Do
you want only me and you and Dan Flickinger to be able to
write a real-life grammar, or any well-trained computational
linguist?
Gerald Penn & Mohammed Haji-Abdolhosseini: Topological Parsing
Topological parsing is another very active item at the moment. Penn
himself says it best in his abstract: "...these grammars contain two
primitive notions of constituency that are used to preserve the
semantic or interpretational aspects of phrase structure while at the
same time providing a more efficient backbone for parsing based on
word-order and contiguency constraints."
The contrast is with normal CFG, where constituency and order must be
expressed simultaneously, which is not appropriate when the word order
is freer than in English. (One wonders how much damage has been done
to computational linguistics by the simple fact that Chomsky was
American.)
Here a particular formalization of tectogrammatical vs.
phenogrammatical structure is proposed, including a special parsing
algorithm that proceeds top-down as well as bottom-up.
Judita Preiss: Using Grammatical Relations to Compare Parsers
It is argued that comparing different parsers is difficult because
their output trees may look very different even for the same sentence.
Instead, it is measured how well they find particular grammatical
relations (subj, dobj, etc.) in the Suzanne corpus. Following Lin
(1995), all parser output is mapped into (GR/head/dependent) triples
that are exactly equivalent to labelled dependency edges, although
that term is not used. Surprisingly, Buchholz's shallow GR finder
does better than the state-of-the art statistical parsers
(Carroll, Charniak, Collins etc.). When using this information for
anaphora resolution, the gap narrows quite a bit, apparently because
the error is more often with the resolution algorithm than with the
parsed input.
Karl-Michael Schneider: A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering
This is nothing more than the testing of two Bayesian spam filters: a
multivariate event model and a multinomial event model. Actually, this
only means that you either distinguish the presence or absence of each
word in a message, or the exact number of occurrences, and the latter
method is very slightly more effective.
Sabine Schulte im Walde: Experiments on the Choice of Features for Learning Verb Classes
This work tries to cluster German verbs into semantic groups
automatically by three different types of features: syntactic frames,
preposition preference, and selectional preference. GermaNet's top
semantic classes are used for the last method. Although some plausible
groupings can be achieved, the final result can only serve as the
basis of manual classification. It is hypothesized that the
idiosyncratic features of some verbs impose a limit on the usefulness
of any set of features for automatic classification.
Steedman, Sarkar, Osborne, Hwa, Clark, Hockenmaier, Ruhlen, Baker, Crim: Bootstrapping statistical parsers from small datasets
The idea of co-training is investigated, where two statistical parsers
are used to generate additional training material for each other. This
method achieves better results than self-training, where the parser's
own output structures are used as further training material. However,
the output from a rival tagger is still an order of magnitude less
useful for training than a proper treebank, i.e. 100 `golden' trees
help training more than 1000 co-trained ones.
Research notes
John Barnden, Sheila Glasbey, Mark Lee, Alan Wallington: Domain-transcending mappings in a system for metaphorical reasoning
Metaphor involves the mapping of a set of concepts from a source
domain to a target domain. Example: "We're driving in the fast lane on
the freeway of love." This common metaphor establishes the mappings
love->journey, lovers->travellers, relationship->vehicle. But what
does the excitement of being in the fast lane map to? The authors
argue that it maps to itself as a view-neutral mapping adjunct,
namely that of mental state. They discuss various of these VNMAs and
implement some of them in a system for defeasible metaphorical
Luisa Bentivogli, Emanuele Pianta: Beyond Lexical Units: Enriching WordNets with Phrasets
WordNet can only represent lexical items (car), idiomatic expressions
(kick the bucket) and restricted collocations (criminal record) in its
synsets. Free combinations (fast car) are not stored because their
meanings are purely compositional.
But many of these combinations are very frequent and very helpful e.g.
for word-sense disambiguation (in `fast car', `car' can only mean
`automobile', not `railway carriage'). Also, lexical gaps (there is no
Italian word for `paperboy') are frequently filled with such recurrent
free phrases. Therefore the authors augment the WordNet formalism with
phrasets to store them.
Beate Dorow, Dominic Widdows: Discovering Corpus-Specific Word Senses
Conjunctions of nouns in free text are used to group words into
clusters with similar meanings. The employed algorithm is able to
detect that `mouse' is ambiguous between `rodent' and `input device'
and correctly assigns both meanings to the correct WordNet superclass.
Toshiaki Fujiki, Hidetsugu Nanba, Manabu Okumura: Automatic Acquisition of Script Knowledge from a Text Collection
Scripts are the descriptions of typical event sequences in a
particular domain (e.g., arrive--choose--receive--eat--pay--leave).
Here, script knowledge is extracted from newscasts by noting
successive sentences, coordinated phrases, or subclauses with the same
subject and object. The authors extract sequences such as
find--arrest--prosecute. Although the paper does not mention this,
the accuracy is really only about 50%.
Birte Lönneker, Primo Jakopin: Contents and evaluation of the first Slovenian-German online dictionary
This is simply the contents of the textbook that Birte learnt Slovenian
from, annotated and digitized. Altogether about 70% of the more
frequent tokens in typical newspaper text are covered.
James McCracken, Adam Kilgarriff: Oxford Dictionary of English - current developments
"In order for a non-formalised, natural-language dictionary like ODE
to become properly accessible to computational processing, the
dictionary content must be positioned within a formalism which
explicitly enumerates and classifies all the information that the
dictionary content itself merely assumes, implies, or refers to." The
Oxford dictionary is the one that professional writers use to get the
whole story, and then some. In preparation to making their definitions
available to selected CL projects, the editors are adding
supplementary data such as explicit morphology, collocations and domain
labels semi-automatically.
Horacio Saggion, Katerina Pastra, Yorick Wilks: NLP for Indexing and Retrieval of Captioned Photographs
The criminal profiler, hero of much current TV fare, is usually
represented as sitting in an office, patiently browsing through album
after album of evidence from crime scenes, searching for patterns that
will allow him to connect past and present cases, and unfortunately,
this picture is largely accurate. This work aims to ease his job by
parsing and analysing the handwritten photo captions, and adding
limited inferring capability.
Michael Schiehlen: A Cascaded Finite-State Parser for German
This disturbing work cites a pure FST parser that achieves much the
same results on the NEGRA corpus as we do.
Zdenek Zabokrtsky, Otakar Smrz: Arabic Syntactic Trees: from Constituency to Dependency
The first large-scale dependency tree bank of Arabic is being created,
largely by converting existing phrase structure annotations. The
authors describe work in progress that achieves about 60% structural
correctness.
Related Topics: ChunkerExperiments, TreeChunkerReport, HeiseGrammar,
http://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/EACL
--
MichaelDaum - 01 Nov 2002