Файл:Vocal Tract Length Perturbation (VTLP) improves speech recognition.pdf
Navdeep Jaitly email@example.com University of Toronto, 10 King’s College Rd., Toronto, ON M5S 3G4 CANADA Geoﬀrey E. Hinton firstname.lastname@example.org University of Toronto, 10 King’s College Rd., Toronto, ON M5S 3G4 CANADA
Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. However this approach has (to our knowledge) not been exploited successfully in speech recognition (with or without neural networks). In this paper we lay the foundation for this approach, and show one way of augmenting speech datasets by transforming spectrograms, using a random linear warping along the frequency dimension. In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than ﬁtting a single warp factor to each training and test speaker (or utterance). At test time, a prediction is made by averaging the predictions over multiple warp factors. When this technique is applied to TIMIT using Deep Neural Networks (DNN) of diﬀerent depths, the Phone Error Rate (PER) improved by an average of 0.65% on the test set. For a Convolutional neural network (CNN) with convolutional layer in the bottom, a gain of 1.0% was observed. These improvements were achieved without increasing the number of training epochs, and suggest that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
Нажмите на дату/время, чтобы просмотреть, как тогда выглядел файл.
|текущий||17:27, 22 декабря 2016||0 × 0 (287 КБ)||Slikos|
- Вы не можете перезаписать этот файл.