Voice Cloning with Chatterbox

Introduction

Chatterbox is a free, open-source text-to-speech (TTS) engine that you run locally on your own computer, turning any text into lifelike audio. It was created by Resemble AI and according to them Chatterbox even beats ElevenLabs' TTS models. ^[1]

Example

A Beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV.

— First paragraph of Frank Hebert's Dune

Lively female voice:

Deep male voice:

What can it do?

Zero-shot voice cloning: capture anyone’s voice from just a few seconds of reference audio.

Exaggeration control: dial the expressiveness up or down for a more animated or subtle delivery.

Pacing adjustment: speed up or slow down the speech to fit the desired rhythm.

Free-text synthesis: type any words you like and the cloned voice will speak them aloud.

Install

Installing Chatterbox

info

If you do not have powerful enough hardware, then you can run it for free using Google Colab.

Installing ComfyUI

Visit their official website, and install it. ComfyUI is a graphic user interface designed specifically for Stable Diffusion workflows, and uses a node-based approach that allows users to visually construct their workflow. Although Chatterbox isn't stable diffusion, we can still use it for Chatterbox.

Installing Chatterbox

Go the the Custom Nodes Manager search for ComfyUI_Fill-ChatterBox and let it install. When you're done restart ComfyUI.

Setting up the nodes

Node Library Icon

Open the Node Library using (n) click on the Chatterbox folder then FL Chatterbox TTS and place that node on your canvas. Then go to the audio folder and add the LoadAudio and PreviewAudio node. Copy this basic set-up.

Generating audio

Go to the LoadAudio and upload an audio sample of a voice you like. Enter the sentence you want it to say. After a few minutes you can listen to the result at the PreviewAudio node.

info

I recommend keep_model_loaded on true if your hardware is beefy enough. This will reduce generation times when you run the model again.

Parameters

Exaggeration: how much "expressive/dramatisch" the output will be. Don't set this too high.
temperature: controls how much the output is allowed to differ from the original voice.
cfg_weight: controls the pacing of the output.
use_cpu: enable this if you've got weaker hardware.
keep_model_load: keeps model loaded in your vram. Enable this if you got a powerful GPU.

Recommended settings

General use

The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts.^[2]
If the reference speaker has a fast speaking style, lowering cfg_weight to around 0.3 can improve pacing.

Expressive or Dramatic Speech

Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.
Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.