Speaker Adaptation

The literature on adaptation of speaker, environment, and application is rich of techniques for refining Automatic Speech Recognition (ASR) systems by adapting the acoustic features and the parameters of stochastic models [5,13,8,16,9]. More recently, particular attention has been paid to discriminative training techniques and their application to the acoustic feature transformation [7,11].

Since discriminative methods are also used to train the acoustic-phonetic Artificial Neural Networks (ANN) models, it is worth exploring methods for adapting their features and model parameters. Several solutions to this problem have been proposed. Some of these techniques for adapting neural networks are compared in [2,14]. A classical approach consists in adding a linear transformation network (LIN) that acts as a pre-processor to the main network. Alternatively, it could be possible to simply adapt all the weights of the original network. A tied-posterior approach is proposed in [17] to combine Hidden Markov Models (HMM) with ANN adaptation strategies. The weights of a hybrid ANN/HMM system are adapted by optimising the training set cross entropy. A sub-set of the hidden units is selected for this purpose. The adaptation data are propagated through the original ANN. The nodes that exhibit the highest variances are selected, since hidden nodes with a high variance transfer a larger amount of information to the output layer. Then, only the weights of the links coming out of the selected nodes are adapted.

Recent adaptation techniques have been proposed with the useful properties of not requiring to store the previously used training data, and to be effective even with a small amount of adaptation data. Methods based on speaker space adaptation [13] and eigenvoices [8] are of this type and can be applied both to Gaussian Mixture HMMs as well as to the ANN inputs as proposed in [4]. The parameters of the transformations are considered the components of a vector in a parameter adaptation space. The principal components of this space define a speaker space. Rapid adaptation consists in finding the values of the coordinates of a specific speaker point in the speaker space.

Another approach is the regularised adaptation proposed in [10], where the original weights of the networks, trained with unadapted data, are the a priori knowledge used to control the degree of adaptation, to avoid overfitting on adaptation data.

This paper explores a new possibility consisting in adapting ANN models with transformations of an entire set of internal model features. Values for these features are collected at the output of a hidden layer for which the number of outputs is usually of the order of a few hundreds. These features are supposed to represent an internal structure of the input pattern. As for input feature transformation, a linear network can be used for hidden layer feature transformation. In both cases, the estimation of the parameters of the adaptation networks can be done with error Back-Propagation by keeping unchanged the values of the parameters of the ANN.

A problem, however, occurs in distributed connectionist learning when a network, trained with a large set of patterns, has to be adapted to classify input patterns that differ in some aspects from the ones used originally to train the network. A problem called ``catastrophic forgetting'' [12] arises when a network is adapted with new data that do not adequately represent the knowledge included in the original training data. This causes a particularly severe performance degradation. This happens when adaptation data do not contain examples for a subset of the output classes.

A review of several approaches that has been proposed to solve this problem is presented in [12]. One of them uses a set of pseudo-patterns, i.e. random patterns, associated to the output values produced by the connectionist network before adaptation. These pseudo-patterns are added to the set of the new patterns to be learned [1]. The attempt is to keep stable the classification boundaries related to classes that have few or no samples in the new set of patterns. This effectively decreases the catastrophic forgetting of the knowledge provided by originally learned patterns. Tests of this solution have been reported with small networks and low dimensional artificial input patterns. Unfortunately, they do not scale well because it is difficult to generate effective pseudo-patterns when the dimensionality of the input features is high.

For this reason, it has been proposed [3] to include examples of the missing classes, taken from the training set, in the adaptation set. However, the addition of a small subset of training examples related to the missing classes could redefine the class boundaries according to the distribution of these small subsets. This distribution would be different from the one of the complete training set. Moreover, this approach has a main practical problem: it is mandatory to store training set samples for the adaptation step. The number of samples should be large enough to provide a good preservation of the class boundaries. Finally, since the task independent network could be adapted to several applications, different sets of training patterns would be necessary to compensate classes missing in different adaptation sets.

This paper proposes a solution to this problem by introducing Conservative Training, a variation to the standard method of assigning the target values, which compensates for the lack of adaptation samples in some classes. The key idea of Conservative Training is that the probability of classes with no adaptation samples available should be replaced by the best available estimations of their real values. The only way to obtain these estimations is with the model provided by the original network.

Experimental results on the adaptation test for the Wall Street Journal task [15], using the proposed approaches, compare favourably with published results on the same task [17,15].

The paper is organised as follows: Section 2 gives a short overview of the acoustic-phonetic models of the ANN used by the ASR system, and presents the Linear Hidden Networks, which transform the features at the output of hidden layers. Section 3 is devoted to the illustration of the problem of catastrophic forgetting in connectionist learning, and proposes our Conservative Training approach as a possible solution. Section 4 illustrates the benefits of Conservative Training using an artificial classification task of 16 classes. Section 5 reports the experiments performed on several databases with the aim of clarifying the behaviour of the new adaptation techniques with respect to the classical LIN approach. Finally, the conclusions and future developments are presented in the last Section.

Subsections

Stefano Scanzio 2007-10-24