Q: Is lemmatization used in current NLP tasks (since it is a bit complex in morphological complex language like Amharic) what if we use the word as it is and represent it in word embedding? (MW)
A: Sometimes it is used, other times not. It would be a good question to find out whether it is better for a morphologically rich language
- to first lemmatize and then train an embedding on the lemmatization results or
- to train the emdeddings on the raw words.
Of course, the expected result strongly depends on the quality of the lemmatizer. Since many relationships that can be found in embeddings (like singular-plural: "house-houses" or predicate-agent: "drive-driver") are usually expressed by means of morphological variants, lemmatization would probably result in a significant loss of information.
Q: Is that possible to tokenize a text written other than English such as ethiopian Geeze script? (DM)
A: Yes, that should be possible.
Q: The book mentioned that there are exceptional cases that make tokenization difficult like the words "New York", "Rock and roll" and so on, in English, most of the time, word separator is whitespace so is there any method to handle such cases? (YM)
A: I am not aware of any other possibility than to consult a list of the most frequent cases.
Q: In normalization, when we remove unnecessary characters including special characters, and quotations, some of them would cause meaning changes, for example friend's-> shows possession, but when we remove the single quotation it changes to plural form. how can we handle such cases? (YM)
A: I'd treat the possession marker as a suffix of the preceding stem. But note that the same encoding is also used for abbreviating pronoun-auxiliary pairs, as in the initial "I'd' or in "I'm", "he's" etc.
Q: why we learn Regular expressions, text normalization and Edit distance for application of NLP? (Tadesse)
A: Because they provide important fundamental concepts about the topics of Finite state automata, non-determinism and search. Moreover, regular expressions are a highly versatile tool to automate text-level editing for data preparation in general and text normalization in particular.
What is Dynamic Programming? (LE)
Q: How to measure similarity (alignment) between two strings ( coreferent) using minimum edit distance if words in strings may have some edit distance. Do we use a nested minimum edit distance? (MW)
A: Yes.
Q: Code switching in a single communication becomes common in the world, specially English with the rest of languages. it has side effects on NLP tasks like word processing, ... so, is there any NLP method to make such speechs to one form? (YM)
A: I have never heard about a principled solution to the problem of code switching, but I would look out for potentially useful approaches in the area of multilingual machine translation.
Q: How morphological parser handles morphologically complex languages, for example, word that has infix in it, subject verb agreement...? <literal>ስብርባሪዎችሽን</literal> (your broaken pieces) <literal> ስብር-ባሪ - ኦች-ሽ-ን፣ አሰባበር</literal> (YM)
A: Infixation can be dealt with a proxy-solution by a concatenation with the preceding and following morph. Finite state machines are able to do so without any additional mechanisms. More interesting is the question on how to deal with the root-pattern morphology of semitic languages. It has been shown that two-level morphology (which is based on finite state transducers) is able to cope with this problem.
Q: Is fine-tuning possible on Edit distance? because it is not as efficient as English for Amharic language, for example, “<literal>አበላ</literal>” and “<literal>አበላሸ</literal>” are Amharic words, but their edit distance value is not as distant as their meaning, they have completely different meaning but their edit distance is 1? is edit distance efficient for all languages? (YM)
A: There is no systematic relationship between the pronunciation/spelling and the meaning of a word, neither in Amharic nor in English or any other language of the world. Therefore, the edit distance does not tell us anything about the similarity of meanings (see e.g. the minimal pairs cut-car, winter-winder, home-dome, ...)
Q: what is byte-pair encoding and how it works? (MW)
A: It is a method to break down the character sequences of the words in a training corpus into a set of possible substrings guided by their frequency of occurrence. These substrings are then used in turn to tokenize test data in a greedy manner. This results in breaking down the unknown words into as large as possible pieces, to be used as elementary units for any kind of follow-up NLP task, while the known words (already seen during training) are left unaffected. A very good explanation what byte pair encoding means in NLP is given in the book.
--
WolfgangMenzel - 22 Feb 2023