Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Abstract. Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .

Model Overview

Our method comprises three consecutive stages, utilizing two distinct types of discrete units: 1) speech-to-semantic-unit translation stage S₁, which converts source audio into semantic units of the translated speech; 2) acoustic unit modeling stage S₂, generating target acoustic units conditioned on the semantic output from the preceding stage and the acoustic units of the source speech as style prompt; 3) unit-to-wave generation stage S₃, producing translated speech that maintains consistent style with the source.

The training procedure of S₂ adopts a self-supervised training paradigm, where the first three seconds of each audio sample is truncated as prompt, and the acoustic language model is trained to predict the acoustic units of the remaining part conditioned on its semantic units and the prompt acoustic units with cross-entropy loss. This in-context learning approach enables the model to grasp the correspondence in acoustic characteristics between the two parts and acquire style transfer ability. During inference, we use semantic tokens from the previous stage and acoustic units of source speech as the style prompt to realize cross-lingual style transfer.

Translation on Es-En

Translation on Fr-En

Translation on Es-En

In this section, we provide results on Es-En translation. (You may need to scroll right to see full results.)

	Source	CVSS-T target	S2UT	Ours	S2UT + PPG-VC	S2UT + NANSY	S2UT + YourTTS
Audio
Ref Text	Dependiendo del tipo de material se conserva en la Biblioteca o el Archivo.	depending on the type of material it is kept in the library or in the archive
Asr Result			depending on the type of material it is preserved in the library or the archive	depending on the type of material it is preserved in the library or the archive	depending on the type of material it is preserved in the library or the archive	depending on the type of material it is preserved in the library or the archive	depending on the type of material it is preserved in the library or the archive
Audio
Ref Text	Los ocho números de la revista fueron publicados por Fantasy Publishing Company, Inc.	the eight volumes of the magazine were published by fantasy publishing company inc
Asr Result			the eight numbers of the magazines were published by fentasy publishing company incorporated	the eight numbers of the magazines were published by fantasy publishing company incorporated	the eight numbers of the magazines were published by fantasy publishing company incorporated	the eight numbers of the magazines were published by fantasy publishing company incorporated	the eight numbers of the magazines were published by fantasy publishing company incorporated
Audio
Ref Text	Cantacuceno casó a su hija Helena con el joven emperador para sellar el acuerdo.	kantakouzenos married his daughter helena with the young emperor in order to seal the deal
Asr Result			canta cosena marry his daughter helena with the young emperor to save the agreement	cante poseno marrid his daughter helena with the young emperor to save the agreement	canto caina marry his daughter elena with the young emperor to save the agreement	cantacosena married his daughter helena with the young emperor to save the agreement	cantacusena married his daughter helena with the young emperor to say the agreement
Audio
Ref Text	Lideró una rebelión contra Pierre Nord Alexis y lo sucedió como presidente.	he led a rebellion against pierre nord alexis and succeeded him as president
Asr Result			he lead a rebellion against pierre nod alexis and succeeded as president	he lead a rebellion against biernad alexis and succeeded as president	he laid a rebellion against pierre nod alexis and succeeded as president	he laid a rebellion against pierr nod alexis and succeeded as president	he laid a rebellion against piernod alexis and succeeded as president

Translation on Fr-En

In this section, we provide results on Fr-En translation. (You may need to scroll right to see full results.)

	Source	CVSS-T target	S2UT	Ours	S2UT + PPG-VC	S2UT + NANSY	S2UT + YourTTS
Audio
Ref Text	Il est situé au sud-est de l'île, à quelques kilomètres de Joao Barrosa.	it is located on the southeast part of the island several kilometers from joao barrosa
Asr Result			it is located in the southeast of the island wich a few kilometers of gelboros	it is located in the south east of the island with a few kilometers of jal borrows	it is located in the southeast of the island which a few kilometers of gelbros	it is located in the southeast of the island with a few kilometers of gelborougs	it is located in the southeast of the island with a few kilometers of jobros
Audio
Ref Text	Chladni est le fondateur de l'acoustique moderne.	chladni is the founder of modern acoustics
Asr Result			cadne is the founder of modern acoustics	cadney is the founder of modern acoustics	cadne is the founder of modern equistics	cadme is the founder of modern acoustics	cadne is the founder of modern acoustics
Audio
Ref Text	Depuis sa sortie, les expériences avec cet album sont très positives pour le groupe.	since its release the experiences with this album has been very positive for the band
Asr Result			since its release the experiences with this album are very positive for the band	since this release the experiences with this album are very positive for the band	since its release the experiences with this elblem are very positive for the band	since its release the experiences with this album are very positive for the band	since its release the experiences with this album are very causative for the band
Audio
Ref Text	Il s’agit donc d’une séquence continue de diminution du déficit nominal.	it is therefore a continuous sequence of decreasing the nominal deficit
Asr Result			this is a continuous sequence of decrease with a nominal deficit	this is a continuous sequence of decreased with the nominal deficit	this is a continuous sequence of decrease with the nominal deficit	this is a continuous sequence of decreased with the nominal deficit	this is a continuous sequence of decrease to the nominal deficit

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Model Overview

Table of Contents

Translation on Es-En

Translation on Fr-En