Preparatory assignments
Natural Language Processing
Shared tasks have evolved into a universal means to boost scientific progress in Computational Linguistics.
The idea goes back to the positive experience with early competitions in areas like Speech Recognition or
Information Retrieval which started around 1990. Nowadays, a wealth of different tasks have been defined,
some of them used only once, others have reached the status of a benchmark test and continue to be used
as a generally excepted evaluation standard.
Familiarize yourself with a couple of past or current shared tasks. Select one of them which has not been chosen
by another participant of our course. Try to answer as many of the following questions. These findings will form
the basis of an essay you are expected to submit later as a written exam for this course. Prepare a short talk
of approximately 20 minutes to inform the other course participants about the most relevant aspects of "your" task.
Hint: Choose a closed task that has already been used in a competition, because then you will be able to find
papers describing the different approaches adopted by the different authors as well as a comparative evaluation
indicating the most successful ones.
- What's the task?
- Which linguistic areas are addressed? (lexicon, morphology, syntax, semantics, phonetics, ...)
- What's the processing problem? (classification, regression, prediction, ...)
- What are possible applications of the expected findings?
- Why is this task interesting to you?
- Which data/resources can or should be used?
- Which data/resources are (publicly) available?
- Which languages are considered?
- Is one of the Ethiopian languages among them ?
- Can the task be applied or adapted to one or several Ethiopian languages?
- What would be necessary to adapt the task? Is this feasible?
- How difficult it would be to participate?
- What kind of expertise will be necessary?
- How laborious a participation would be (in terms of person months)?
- Which kind of infrastructure would be necessary (hardware and software)?
Think about additional questions that could be asked about the task you selected.
Recently conducted shared tasks can be found e.g. on the following web pages. Most of them are carried out annually
and the pages contain links to the previous iterations. Search the web for more examples of shared tasks in NLP.
Automata and Grammars
- Install one of the available toolkits for working with Finite State Transducers (FST) on your computer.
- Recall basic notions of
- Find out the relationships between Finite state machines, regular expressions, regular grammars and regular languages
- Many cultures around the world structure their songs and instrumentals as a playful call-and-response dialogue, which is particularly characteristic for the music of Africa.
- Implement a finite state machine for valid sequence of calls (C) and responses (R). Display your solution as a state transition table, a state transition diagram, a regular expression and a regular grammar.
- Is the language accepted by your machine a finite one? Is your automaton deterministic?
- The blues pattern is a very influential traditional system of chord progression in popular music which comes in various flavours, e.g. as one of the many variants of a 8-bar, 12-bar or 16-bar blues pattern.
- Implement one of the possible chord sequences as a finite state machine, for instance my favorite: C C C C | F F C C | G G F F | C C C G | ...
- Is the language accepted by your automaton a finite one? Is your automaton deterministic?
- Implement an automaton that accepts all personal pronouns of your mother tongue, e.g. as given here for Amharic.
- Is the language accepted by your automaton a finite one? Is your automaton deterministic?
- Is the automaton sound, i.e. does it accept only valid personal pronouns of your language? Is it complete, i.e. does it accept all the personal pronouns of your language? If not, count the number of false positives and/or false negatives.
- How many states has your automaton. Is there still room for minimization?
- Recall basic notions of Finite State Transducers (FST)
- Similarities and differences between FSMs and FSTs
- Moore vs. Mealy automata
- Operations on FSTs (in particular: union, concatenation, composition)
- Find an example of a character sequence which can be better modelled with a non-deterministic automaton than with a deterministic one. Do such cases occur in your language?
- Find an example of a a character sequence which cannot be modelled with a Finite State Machine? Do such cases occur in your language?
Probability
- Recall the notion of discrete probability and discrete probability distribution. Estimate the probability distribution for the characters in a short text of your mother tongue. How many zero-probabilities did you encounter?
- Recall the notion of conditional probability and conditional probability distribution. Estimate a small part of the conditional probability distribution for the characters in your text, given the immediately preceding character. Give some examples of zero probabilities. Distinguish between successive character pairs that are impossible in your language and others, which are possible, but not occurring in your text.
- Calculate from the above estimations the probability of the occurrence of a two-character sequence. Which assumptions do you have to make, in order to be able to do so?
- Calculate from the above estimations the probability of the occurrence of a three-character sequence. Which assumptions do you have to make, in order to be able to do so?
- Can you compute the probability of a word based solely on the above estimations?
--
WolfgangMenzel - 17 Jul 2018