Abstract. Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .
Our method comprises three consecutive stages, utilizing two distinct types of discrete units: 1) speech-to-semantic-unit translation stage S1, which converts source audio into semantic units of the target speech; 2) acoustic unit modeling stage S2, generating target acoustic units conditioned on the semantic output from the preceding stage and the acoustic units of the source speech as style prompt; 3) unit-to-wave generation stage S3, producing target audio that maintains consistent style with the source.
During training of S2, we extract semantic and acoustic units from the training data. We divide each training sample into two separate parts, using the acoustic units from one part as prompt and those from the other as prediction targets, and train the model to generate corresponding acoustic units from the semantic units and prompt with cross-entropy loss. This in-context learning approach enables the model to grasp the correspondence in acoustic characteristics between the two parts, thus acquiring the ability for style transfer. And such a self-supervised training approach needs no speaker-parallel data, and can be scaled to massive training data. During inference, we use semantic tokens from the previous stage and acoustic units of source speech as the style prompt to realize cross-lingual style transfer.
In this section, we provide results on Es-En translation.
|Source||CVSS-T target||S2UT||S2UT + ms-vocoder||Ours|
|Ref Text||Dependiendo del tipo de material se conserva en la Biblioteca o el Archivo.||depending on the type of material it is kept in the library or in the archive|
|Asr Result||depending on the type of material it is preserved in the library or the archive||depending on the type of material it is preserved in the library or the archive||depending on the type of material it is preserved in the library or the archive|
|Ref Text||Los ocho números de la revista fueron publicados por Fantasy Publishing Company, Inc.||the eight volumes of the magazine were published by fantasy publishing company inc|
|Asr Result||the eight numbers of the magazines were published by fentasy publishing company incorporated||the eight numbers of the magazines were published by fentasy publishing company incorporated||eight numbers of the magazines were published by fantasy publishing company incorporated|
|Ref Text||Cantacuceno casó a su hija Helena con el joven emperador para sellar el acuerdo.||kantakouzenos married his daughter helena with the young emperor in order to seal the deal|
|Asr Result||canta cosena marry his daughter helena with the young emperor to save the agreement||canto coseno marry his daughter elena with the young emperor to save the agreement||cante poseno marrid his daughter helena with the young emperor to save the agreement|
|Ref Text||Lideró una rebelión contra Pierre Nord Alexis y lo sucedió como presidente.||he led a rebellion against pierre nord alexis and succeeded him as president|
|Asr Result||he lead a rebellion against pierre nod alexis and succeeded as president||he lead a rebellion against pierre not alexis and succeeded as president||he lead a rebellion against biernad alexis and succeeded as president|
In this section, we provide results on Fr-En translation.
|Source||CVSS-T target||S2UT||S2UT + ms-vocoder||Ours|
|Ref Text||Il est situé au sud-est de l'île, à quelques kilomètres de Joao Barrosa.||it is located on the southeast part of the island several kilometers from joao barrosa|
|Asr Result||it is located in the southeast of the island wich a few kilometers of gelboros||it is located in the southeast of the island with a few kilometers of jelbros||it is located in the south east of the island with a few kilometers of jal borrows|
|Ref Text||Chladni est le fondateur de l'acoustique moderne.||chladni is the founder of modern acoustics|
|Asr Result||cadne is the founder of modern acoustics||cadmy is the founder of modern aquistics||cadney is the founder of modern acoustics|
|Ref Text||Depuis sa sortie, les expériences avec cet album sont très positives pour le groupe.||since its release the experiences with this album has been very positive for the band|
|Asr Result||since its release the experiences with this album are very positive for the band||since its release the experiences with this album are very positive for the band||since this release the experiences with this album are very positive for the band|
|Ref Text||Il s’agit donc d’une séquence continue de diminution du déficit nominal.||it is therefore a continuous sequence of decreasing the nominal deficit|
|Asr Result||this is a continuous sequence of decrease with a nominal deficit||this is a continuous sequence of decreased with a nominal deficit||this is a continuous sequence of decreased with the nominal deficit|