Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Abstract. Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We propose an S2ST framework with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .

Model Overview



Our method comprises three consecutive stages, utilizing two distinct types of discrete units: 1) speech-to-semantic-unit translation stage S1, which converts source audio into semantic units of the translated speech; 2) acoustic unit modeling stage S2, generating target acoustic units conditioned on the semantic output from the preceding stage and the acoustic units of the source speech as style prompt; 3) unit-to-wave generation stage S3, producing translated speech that maintains consistent style with the source. We provide details about these two types of units and the three stages in the following subsections.

The training procedure of S2 adopts a self-supervised training paradigm, where the first three seconds of each audio sample is truncated as prompt, and the acoustic language model is trained to predict the acoustic units of the remaining part conditioned on its semantic units and the prompt acoustic units with cross-entropy loss. This in-context learning approach enables the model to grasp the correspondence in acoustic characteristics between the two parts and acquire style transfer ability. During inference, we use semantic tokens from the previous stage and acoustic units of source speech as the style prompt to realize cross-lingual style transfer.

Table of Contents

  • Translation on Es-En
  • Translation on Fr-En
  • Translation on Es-En

    In this section, we provide results on Es-En translation. (You may need to scroll right to see full results.)

    Source CVSS-T target S2UT Ours S2UT + PPG-VC S2UT + NANSY S2UT + YourTTS
    Audio
    Ref Text Dependiendo del tipo de material se conserva en la Biblioteca o el Archivo. depending on the type of material it is kept in the library or in the archive
    Asr Result depending on the type of material it is preserved in the library or the archive depending on the type of material it is preserved in the library or the archive depending on the type of material it is preserved in the library or the archive depending on the type of material it is preserved in the library or the archive depending on the type of material it is preserved in the library or the archive
    Audio
    Ref Text Los ocho números de la revista fueron publicados por Fantasy Publishing Company, Inc. the eight volumes of the magazine were published by fantasy publishing company inc
    Asr Result the eight numbers of the magazines were published by fentasy publishing company incorporated the eight numbers of the magazines were published by fantasy publishing company incorporated the eight numbers of the magazines were published by fantasy publishing company incorporated the eight numbers of the magazines were published by fantasy publishing company incorporated the eight numbers of the magazines were published by fantasy publishing company incorporated
    Audio
    Ref Text Cantacuceno casó a su hija Helena con el joven emperador para sellar el acuerdo. kantakouzenos married his daughter helena with the young emperor in order to seal the deal
    Asr Result canta cosena marry his daughter helena with the young emperor to save the agreement cante poseno marrid his daughter helena with the young emperor to save the agreement canto caina marry his daughter elena with the young emperor to save the agreement cantacosena married his daughter helena with the young emperor to save the agreement cantacusena married his daughter helena with the young emperor to say the agreement
    Audio
    Ref Text Lideró una rebelión contra Pierre Nord Alexis y lo sucedió como presidente. he led a rebellion against pierre nord alexis and succeeded him as president
    Asr Result he lead a rebellion against pierre nod alexis and succeeded as president he lead a rebellion against biernad alexis and succeeded as president he laid a rebellion against pierre nod alexis and succeeded as president he laid a rebellion against pierr nod alexis and succeeded as president he laid a rebellion against piernod alexis and succeeded as president

    Translation on Fr-En

    In this section, we provide results on Fr-En translation. (You may need to scroll right to see full results.)

    Source CVSS-T target S2UT Ours S2UT + PPG-VC S2UT + NANSY S2UT + YourTTS
    Audio
    Ref Text Il est situé au sud-est de l'île, à quelques kilomètres de Joao Barrosa. it is located on the southeast part of the island several kilometers from joao barrosa
    Asr Result it is located in the southeast of the island wich a few kilometers of gelboros it is located in the south east of the island with a few kilometers of jal borrows it is located in the southeast of the island which a few kilometers of gelbros it is located in the southeast of the island with a few kilometers of gelborougs it is located in the southeast of the island with a few kilometers of jobros
    Audio
    Ref Text Chladni est le fondateur de l'acoustique moderne. chladni is the founder of modern acoustics
    Asr Result cadne is the founder of modern acoustics cadney is the founder of modern acoustics cadne is the founder of modern equistics cadme is the founder of modern acoustics cadne is the founder of modern acoustics
    Audio
    Ref Text Depuis sa sortie, les expériences avec cet album sont très positives pour le groupe. since its release the experiences with this album has been very positive for the band
    Asr Result since its release the experiences with this album are very positive for the band since this release the experiences with this album are very positive for the band since its release the experiences with this elblem are very positive for the band since its release the experiences with this album are very positive for the band since its release the experiences with this album are very causative for the band
    Audio
    Ref Text Il s’agit donc d’une séquence continue de diminution du déficit nominal. it is therefore a continuous sequence of decreasing the nominal deficit
    Asr Result this is a continuous sequence of decrease with a nominal deficit this is a continuous sequence of decreased with the nominal deficit this is a continuous sequence of decrease with the nominal deficit this is a continuous sequence of decreased with the nominal deficit this is a continuous sequence of decrease to the nominal deficit