Conservative Training

It is well known that in connectionist learning, acquiring new information in the adaptation process may cause a partial or total oblivion of the previously learned information [12,1]. This effect must be taken into account when adapting an ANN with a limited amount of data, i.e. when the probability of the absence of samples for some acoustic-phonetic units is high. The problem is more severe in the ANN modeling framework than in the classical Gaussian Mixture HMMs. The reason is that an ANN uses discriminative training to estimate the posterior probability of each acoustic-phonetic unit. The minimization of the output error is performed by means of the Back-Propagation algorithm that penalizes the units with no observations in the adaptation set by setting to zero the target value of the their output units for every adaptation frame. This target assignment policy reduces the ANN capability of correctly classifying the corresponding acoustic-phonetic units. On the contrary, the Gaussian Mixture models with little or no observations remain un-adapted, or share some adaptation transformations of their parameters with other similar acoustic models, maintaining the knowledge acquired before adaptation. To mitigate the just introduced oblivion problem, it has been proposed [3] to include in the adaptation set examples of the missing classes taken from the training set. The disadvantage of this approach is that a substantial amount of the training set must be stored in order to have enough examples of the missing classes for each adaptation task. In [1], it has been proposed to approximate the real patterns with pseudo-patterns rather than using the training set. A pseudo-pattern is a pair of a random input activation and its corresponding output. These pseudo-patterns are included in the set of the new patterns to be learned to prevent catastrophic forgetting of the original patterns. The proposed solutions have problems when applied to the adaptation of large ANNs. In fact, there are no criteria for selecting adaptation samples from the training data which are often not available when adaptation is performed. Moreover, the selected data should share some characteristics that make the adaptation environment different from the training one, but the elements of such a difference are often unknown. Furthermore, it is unclear how effective pseudo-patterns can be generated when the dimensionality of the input features is high.

A solution, called Conservative Training (CT), is now proposed to mitigate the forgetting problem. Since the Back-Propagation technique used for MLP training is discriminative, the units for which no observations are available in the adaptation set will have zero as a target value for all the adaptation samples. Thus, during adaptation, the weights of the MLP will be biased to favor the output activations of the units with samples in the adaptation set and to weaken the other units, which will always have a posterior probability getting closer to zero. Conservative Training does not set to zero the value of the targets of the missing units; it uses instead the outputs computed by the original network as target values. Regularization as proposed in [10] is another solution to the forgetting problem. Regularization has theoretical justifications and affects all the ANN outputs by constraining the network weight variations. Unfortunately, regularization does not directly address the problem of classes that do not appear in the adaptation set. We tested the regularization approach in a preliminary set of experiments, obtaining minor improvements. Furthermore, we found difficult to tune a single regularization parameter that could perform the adaptation avoiding catastrophic forgetting. Conservative Training, on the contrary, takes explicitly all the output units into account, by providing target values that are estimated by the original ANN model using samples of units available in the adaptation set.

Let $F_p$ be the set of phonetic units included in the adaptation set ($p$ indicates presence), and let $F_m$ be the set of the missing units. In Conservative Training the target values are assigned as follows:


$\displaystyle T(f_i \in F_m \mid O_t)$ $\textstyle =$ $\displaystyle OUTPUT\_ORIGINAL\_NN(f_i \mid O_t)$ (6)
$\displaystyle T(f_i \in F_p \mid O_t \cap correct(f_i \mid O_t))$ $\textstyle =$   (7)
$\displaystyle (1.0-\sum_{j \in F_m}OUTPUT\_ORIGINAL\_NN($ $\textstyle f_j$ $\displaystyle \mid O_t)$ (8)
$\displaystyle T(f_i \in F_p \mid O_t \cap !correct(f_i \mid O_t))$ $\textstyle =$ $\displaystyle 0.0$ (9)

where $T(f_i \in F_p \mid O_t)$ is the target value associated to the input pattern $O_t$ for a unit $f_i$ that is present in the adaptation set. $T(f_i \in F_m \mid O_t)$ is a target value associated to the input pattern $O_t$ for a unit not present in the adaptation set. $OUTPUT\_ORIGINAL\_NN(f_i \mid O_t)$ is the output of the original network (before adaptation) for the phonetic unit i given the input pattern $O_t$, and $correct(f_i \mid O_t)$ is a predicate which is true if the phonetic unit $f_i$ is the correct class for the input pattern $O_t$. Thus, a phonetic unit that is missing in the adaptation set will keep the value that it would have had with the original un-adapted network, rather than obtaining a zero target value for each input pattern. This policy, like many other target assignment policies, is not optimal. Nevertheless, it has the advantage of being applicable in practice to large and very large vocabulary ASR systems using information from the adaptation environment, and avoiding the destruction of the class boundaries of missing classes. It is worth noting that in badly mismatched training and adaptation conditions, for example in some environmental adaptation tasks, acoustically mismatched adaptation samples may produce unpredictable activations in the target network. This is a real problem for all adaptation approaches: if the adaptation data are scarce and they have largely different characteristics - SNR, channel, speaker age, etc. - other normalization techniques have to be used for transforming the input patterns to a domain similar to the original acoustic space. Although different strategies of target assignment can be devised, the experiments reported in the next sections have been performed using only this approach. Possible variations, within the same framework, include the fuzzy definition of missing classes and the interpolation of the original network output with the standard 0/1 targets.

Stefano Scanzio 2007-10-24