Q: how do we systematic determine learning rate to overcome problem of overpassing and late to converge to optimal point in RNN leaning (MW)
A: If the learning rate is too large, the training procedure is likely to "overshoot" the point of the optimum, needs to correct this and as a consequence is forced to alternate several times between too large and too small values. This slows down the training. If the learning rate is too small it will have to take many small steps to reach the point of the optimum. This slows down the training as well. The optimal choice of the learning rate should be somewhere between these two extremes. It can be found experimentally by systematically varying the learning rate and monitoring the training time until convergence.
Q: is the capacity of LSTM memory cell limited? how many informations can it stores?[YM]
A: Of course it is limited, because it is only a vector of fixed length. However it is difficult (if not impossible) to quantify this capacity. Dropping one ore several dimensions from a high dimensional (numerical) feature space, amounts to projecting the lost dimensions to the remaining ones (Imagine the projection of a three-dimensional space onto one of its two dimensional planes. Most likely this will lead to a loss of some distinctions, because some points in the higher-dimensional feature space will be mapped to the same or very similar points in the lower-dimensional one. Whether these distinctions are relevant depend on the application. Their impact can only be observed indirectly by measuring the corresponding loss in output quality. Because the feature space is defined by real numbers, you cannot count the information that is stored/transported in the memory channel. Perhaps, you could imagine that the general picture of the problem at hand becomes more blurred or fuzzy.
Actually, the feature space is not created by downsizing its number of dimensions, but their number is predefined and the training procedure is forced to allocate the weights (and the representations which result from them) in a way which is constrained by the available number of dimensions. But the effect is the same: The smaller the number of dimensions which are available, the less distinctions the model will be able to make or the more coarse-grained these distinctions will become (i.e. fewer exeptions to a rule can be learned).
Q: What is forget gate in LSTM & how it works? or why it's used? [LE]
A: The forget gate in a LSTM is one of the two gates that control the kind of information that needs to be forwarded to subsequent units of the model. In contrast to the add gate it determines which information can be deleted (better: neutralized) from the context vector. To do so it creates a mask that is multiplied component-wise to the current content of this layer..
Q: in RNN or in deep learning in general, most of their activities are implicit, we don't know how they store data, how they select features and so on..., what is the reason, why everything is not public? [YM]
A: Actually, all information is publicly available. You can inspect it, move it to other applications or modify it. Unfortunately, because all the representations are distributed ones there only in exceptional cases it is possible for humans to assign a meanings to their individual components. The dynamically determined attention weights of an encoder-decoder model are one of the very few examples I am aware of. They can be used to visualize the impact different input tokens had on the translation of certain output tokens.
To inspect the outcome of a training procedure (i.e. the weight matrices) is similarly difficult. In addition to the lack of ways to interpret the data their sheer amount is so huge that it is not feasible to carry out any kind of manual inspection on them.
The task to develop better models whose results are interpretable or which are even able to explain their behavior is currently a very hot research topic.
--
WolfgangMenzel - 09 Mar 2023