Q: what are the benefits of using vector semantics for NLP?[AH]
A: Word embeddings are the ideal form of input to neural networks. In comparison to one-hot vectors they capture important relationships (meaning similarity, meaning relatedness, similar syntactic behavior) which can be determined by numerical computations (e.g. cosine similarity). Therefore, vectors for complex linguistic constructions can be composed from their elements.
Q: Since we have dynamic or contextualized embeddings, do we really need static embeddings like word2vec this days? [DH]
A. Contextualized embeddings are more expensive to be trained. That allows static embeddings to use larger amounts of training data.
Q: How to embed phrasal verbs in Word2vec? [MW]
A: In principle you could treat phrasal verbs as single multi-word expressions. But then you need to be able to detect them with a high degree of certainty. To simplify the procedure, the components of a phrasal verb are usually considered separate words, and capturing their relationship is left to the learning procedure for the embeddings.
Q: How word sense disambiguate in embedding ? [MW]
A: Embeddings like Word2 vec, FastText, or GloVe are ignorant to the ambiguity of words. In contrast, vector representations like BERT or others based on graph clustering, (Chris Biemann (2006) Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems, Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing. p. 73-80) are able to distinguish between different readings.
Q: In word similarity, examples are given like cat, and dog. How can we say these words are similar, is that in their properties? for example, man and stone have at least one or two similar properties like mass and color [YM]
A: Measuring the semantic similarity between two concepts (not words!) by means of sets of semantic features and computing the overlap of these sets is an old idea of Computational linguistics. As far as I remember it is highly subjective because its results very much depend on the individual choice of possible semantic features and the availability of large dictionaries which assign these features to the concepts.
Q: Does PMI (Pointwise Mutual Information) have similarity with N-gram?[YM]
A: No. n-gram models capture the conditional probability between successive words in a text, while PMI is the joint probability between two words occurring together in some window (document, paragraph, sentence, n-gram). While the former is a directed dependency (the current word is conditioned on the appearance of the words in the left context), the latter is a symmetric measure.
Q: Skip gram predicts the surrounding words using the center word, is it effective when we compare it with N-gram? [YM]
A: If you would like to compare it, use the complementary case to skip-gram, namely bag-of-words and choose an asymmetric window with the focus word being at the rightmost position. But you need to consider that word2vec is able to deal with much larger window sizes than the probabilistic n-grams.
Q: How to rate the similarity of two words?[DM]
A: By humans? Or by machine? For humans: Ask many persons about the similarity of large amounts of word pairs. That turns out to produce surprisingly consistent and stable results. For machines: Use the cosine of two vector representations or any other mathematical procedure with a similar behaviour.
Q: What is the purpose of word embeddings? [DM]
A: see above.
Q: What are the best techniques that can be used to generate word embeddings? [DM]
A: There is no universal answer to this question. It always depends on the application the embeddings are used for. The "best" (in which sense?) one needs to be determined experimentally.
Q: Is there any challenge with word embeddings? [DM]
A: Many! Finding enough training data, determining the optimal architecture of the network and the optimal settings for the other meta-parameters like the proportion of negative samples, making training sensitive to ambiguity, optimizing the training procedure, ...
Q: What is Negative sampling technique in word embedding? (LE)
A: By definition, any training instance found in a natural language text will be a positive one. For principled reasons we are not able to observe any negative samples "out in the wild". Therefore, we need another source for them: We generate them randomly! Since this might results in positive samples occasionally, we would actually need to check whether we can find these random samples somewhere in an actually occurring text to filter them out. But that would also only produce weak evidence, since we never can be sure that we really tried hard enough to find them (there will always be yet another text out there that needs to be checked). For these reasons we do not even attempt to find the randomly generated samples, but just use them, even at the risk that they are actually positive ones. We compensate this risk by chosing more negative samples than positive ones.
Q: Why do we need negative samples at all? (WM)
A: Training the embeddings is based on moving the weights for the words occurring in the same window closer together, and increasing their distance for those found in the negative samples. If there are no negative samples the training procedure has a trivial solution: move all the weight vectors to the very same position in the feature space. Then their distance is minimal. But they are usesless for any practical purpose.
Q: What is the low-dimensional vector space mean?(LE)
A: Usually a vector with 200 ... 1000 dimensions is called low-dimensional (in contrast to vectors of 100000 dimensions).
Q: what is BERT and how it works(TK)
A: That question will be addressed in ch. 10.
--
WolfgangMenzel - 22 Feb 2023