Maximizing Mutual Information for Tacotron, Peng LiuXixin WuShiyin KangGuangzhi LiDan SuDong Yu
https://arxiv.org/abs/1909.01145

This publication investigates local information preference as source for errors in the Tacotron model.

Autoregressive (AR) models predict their next output given their previous outputs – given by a conditional probability P(x_i|x_<i). If they are dependent only on a single previous output they have the Markov Property. AR models are commonly trained using teacher forcing, which means that a previous value x_<i in each autoregressive step is taken from training data (ground truth) instead of the prediction. When both ground truth and prediction are used, this is commonly called scheduled sampling or curriculum learning. A conditional autoregressive (CAR) model furthermore bases each prediction on additional input – P(x_i|x_<i, t) – in case of Tacotron on some sort of linguistic specification t, e.g. characters of an orthographic transcription or phones from the phonetic transcription.

The hypothesis of this paper is that a significant part of errors produced by Tacotron are caused by local information preference, which intuitively means that a given prediction is mostly based on the autoregressive part – i.e. the previous prediction x_<i and not on the additional input t. For Tacotron this is the previous Mel spectral frame while the linguistic input is mostly ignored.

The authors furthermore hypothesize that this is the reason why a reduction factor > 1 works well to make the training more robust. The reduction factor determines, how many frames are produced per inference step. If the model has to produce multiple frames given only one previous frame, it is forced to make use of the additional input t to still satisfy the loss function. Furthermore they argue that this might also be the reason, why the 0.5 prenet dropout and the rather large frame shift in Tacotron are vital for good results.

The proposed methods aim to reward the model for making use of the additional input t. Dropout Frame Rate (DFR) randomly sets the previous input x_i-1 to the global mean, so that the model can not solely rely on those to make future predictions. The other option is Maximizing Mutual Information (MMI) between linguistic input and predicted output features to assure the model learns a meaningful mapping instead of some, subjectively meaningless, representation that just satisfies the loss function.

The results show that both DFR and MMI decrease the number of errors produced by the model and also achieve a good alignment at a much earlier step than without those additions. Furthermore MMI solves the gap between training and validation error often seen when training Tacotron.