ISTD PhD Oral Defense Seminar presented by Perry Lam – Sparsity in Text-to-Speech
Abstract
Neural networks are known to be over-parametrized and sparse models have been shown to perform as well as dense models over a range of image and language processing tasks. However, while compact representations and model compression methods have been applied to speech tasks, sparsification techniques have rarely been used on text-to-speech (TTS) models. We seek to characterize the impact of selected sparse techniques on the performance and model complexity. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained faster, while removing entire neurons degrades performance much more than removing parameters.
Further investigation suggests the faster training occurs for models with multiple decoders, akin to the Mixture-of-Experts architecture of recent large language models. In general, however, maintaining unstructured sparsity costs extra training time and does not reduce model size. Therefore, we propose training with decaying sparsity, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. Our experiments on TTS show that we were able to obtain better losses in the first few training epochs, and that the decaying-sparsity models outperformed constant-sparsity models and edged out dense models, with negligible difference in training time.
Nonetheless, we find that adjusting learning rates can have the same effect as decaying sparsity. Therefore, we propose the ultimate solution of a text-to-prosody generator that can reuse any TTS model with explicit duration, pitch and energy predictions with no additional training. We apply this technique to zero-shot prosody editing and language transfer. With just one TTS model trained only on a single-speaker English dataset, we manage to help it generate Mandarin, German, Spanish and Hungarian, while having half of the character error rate of the next-best zero-shot TTS model.
Speaker’s Profile
Perry Lam is a PhD candidate at the ISTD pillar of Singapore University of Technology and Design, where he is advised by Prof. Dorien Herremans and Prof. Berrak Sisman. He received his B.Eng. Degree from SUTD in 2015 and is interested in speech processing and psycholinguistics. He is most often spotted in the SUTD badminton club as its 12th year team member.