Comparison of resynthesized speech between phonetic tokens and acoustic tokens

[Code] [Model]

Ryota Komatsu

Institute of Science Tokyo

Speech resynthesis samples from LibriTTS-R test set

Original Acoustic token Phonetic token
sampling #1 sampling #2

License

The LibriTTS-R dataset is made available by Google LLC under the CC BY 4.0.

References

  1. Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, "LibriTTS-R: A restored multi-speaker text-to-speech corpus," in Proc. Interspeech, 2023, pp. 5496–5500.
  2. M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, and W.-N. Hsu, "Voicebox: Text-guided multilingual universal speech generation at scale," in Proc. Thirty-seventh Conference on Neural Information Processing Systems, vol. 36, 2023, pp. 14005–14034.
  3. S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, "BigVGAN: A universal neural vocoder with large-scale training," in Proc. International Conference on Learning Representations, 2023.
  4. T. A. Nguyen, W.-N. Hsu, A. D’Avirro, B. Shi, I. Gat, M. Fazel-Zarani, T. Remez, J. Copet, G. Synnaeve, M. Hassid, F. Kreuk, Y. Adi, and E. Dupoux, "Expresso: A benchmark and analysis of discrete expressive speech resynthesis," in Proc. Interspeech, 2023, pp. 4823–4827.
  5. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, "High-Fidelity Audio Compression with Improved RVQGAN," in Proc. NeurIPS, 2023, pp. 27980-27993.