Residual Speaker Representation for One-Shot Voice Conversion

Abstract

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. The introduction of multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in both subjective and objective evaluations, demonstrating superior performance and increased robustness.

Framework of the voice conversion and speaker representation control

Voice Conversion

ID Describe
B01 FreeVC, A pretrained speaker encoder is used.
B01 FreeVC-s, jointly trained speaker encoder is used.
B01 Replace speaker encoder in FreeVC with GST.
P01 (Ours) Replace speaker encoder in FreeVC with 4 layers RSM.
P02 (Ours) Ablation study for P01, RSM without residual connections.

Source


Target
P01 (Ours)
P02 (Ours)
B01
B02
B03

Source


Target
P01 (Ours)
P02 (Ours)
B01
B02
B03

Voice Control

Mel-spectrogram of synthesized speech after replacing speaker representations extracted by RSM layer by layer