Maximizing Mutual Information for Tacotron, Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu
This publication investigates local information preference as source for errors in the Tacotron model.
Autoregressive (AR) models predict their next output given their previous outputs – given by a conditional probability P(x_i|x_<i). If they are dependent only on a single previous output they have the Markov Property. AR models are commonly trained using teacher forcing, which means that a previous value x_<i in each autoregressive step is taken from training data (ground truth) instead of the prediction. When both ground truth and prediction are used, this is commonly called scheduled sampling or curriculum learning. A conditional autoregressive (CAR) model furthermore bases each prediction on additional input – P(x_i|x_<i, t) – in case of Tacotron on some sort of linguistic specification t, e.g. characters of an orthographic transcription or phones from the phonetic transcription.
The hypothesis of this paper is that a significant part of errors produced by Tacotron are caused by local information preference, which intuitively means that a given prediction is mostly based on the autoregressive part – i.e. the previous prediction x_<i and not on the additional input t. For Tacotron this is the previous Mel spectral frame while the linguistic input is mostly ignored.
The authors furthermore hypothesize that this is the reason why a reduction factor > 1 works well to make the training more robust. The reduction factor determines, how many frames are produced per inference step. If the model has to produce multiple frames given only one previous frame, it is forced to make use of the additional input t to still satisfy the loss function. Furthermore they argue that this might also be the reason, why the 0.5 prenet dropout and the rather large frame shift in Tacotron are vital for good results.
The proposed methods aim to reward the model for making use of the additional input t. Dropout Frame Rate (DFR) randomly sets the previous input x_i-1 to the global mean, so that the model can not solely rely on those to make future predictions. The other option is Maximizing Mutual Information (MMI) between linguistic input and predicted output features to assure the model learns a meaningful mapping instead of some, subjectively meaningless, representation that just satisfies the loss function.
The results show that both DFR and MMI decrease the number of errors produced by the model and also achieve a good alignment at a much earlier step than without those additions. Furthermore MMI solves the gap between training and validation error often seen when training Tacotron.
I’ve created a short exercise on Markov Models for my students, here’s the story
“You are a valued employee of MegaCorp Inc.
Locked in your basement office you’ve been gathering information about your coworkers Adalbert, Beatrice and Celis. They’re working in shifts on a single machine and you’ve been monitoring and logging their daily routines. Unfortunately they found out, disabled their webcams, removed your Trojan and cut off your access to switches and routers. Even though you are blind now, it’s still mandatory for you to know who is currently working for any, because only Adalbert isn’t vigilant enough to notice when you sneak up to the ground floor to snatch some donuts and cake from their office fridge.
Except… you still got the old logs and you can see the LEDs of their switch blinking. Unfortunately you don’t have physical access, but perhaps you can still try to figure out, who’s shift it is?”
So what’s the task?
- We model this problem using Markov Models (obviously)
- Our Markov states are the activities of a given co-worker
- Each activity produces different LED blinking patterns – so each state/activity emits an observation
- Three iterations:
- Task 1: Markov chain, only modeling states (activities) over time
- Task 2: Markov chain with observations, modeling states (activities) and observations (blinking LEDs) over time, given we have access to both
- Task 3: Hidden Markov model, modeling states (activities) and observations (blinking LEDs) over time, given we have access only to the observations
Let’s look at the data:
- You own logs of daily activities for each of the coworkers A, B and C
- The logs are CSVs in the format: date|sequence of activities
- Activities: Working (0), pause (1), online gaming (2), browsing the internet (3), streaming TV series (4)
- Furthermore you’ve been (independently) gathering data on LED blinking patterns for different activities in the format: activity|pattern
- Pattern: none (0), very low frequency (1), low frequency (2), medium frequency (3), high frequency (4), very high frequency (5)
You can get the Python/Jupyter notebook here:
And check out how the correct plots should look like here:
You can also try it on Google colab:
I’m a notorious parallel reader, jumping between books all the time.
Here I give a couple opinions on the books currently in my pipeline:
- Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville
- Fluent Python, Luciano Ramalho
- Seven Languages in Seven Weeks, Bruce Tate
- Clojure for the Brave and True, Daniel Higginbotham
- A Tour of C++, Bjarne Stroustrup
- The Rust Programming Language, Steve Klabnik and Carol Nichols
- (Hyperion, Dan Simmons)
(Dear international readers, this content appeared first in the German “IT Freelancer Magazin” and is unfortunately only available in German)
Es war Anfang 2012 als ich mich für ein Doktorat im Bereich Sprachsynthese bewarb. Ehrlich gesagt konnte ich mir damals nicht viel darunter vorstellen, aber die Ausschreibung beinhaltete Machine Learning und Künstliche Intelligenz. Und es klang irgendwie… anders. Heute – über 5 Jahre später – habe ich einen ganz guten Überblick über das Thema, stoße aber im Gespräch mit anderen Menschen stets auf das selbe Unverständnis, das ich damals aufwies. Dieser Artikel soll nun in groben Zügen und allgemeinverständlich die Aufgabenstellung und Probleme moderner Sprachsynthese aufzeigen.