AI
ComfyUI
Guide
Tutorial

Voice Cloning with Chatterbox

Learn how to clone voices using Chatterbox, a new leading open source voice cloning AI Model made by Resemble AI.

Shento Hendriks

By Shento Hendriks

Introduction

Chatterbox is a free, open-source text-to-speech (TTS) engine that you run locally on your own computer, turning any text into lifelike audio. It was created by Resemble AI and according to them Chatterbox even beats ElevenLabs' TTS models. [1]

Example

A Beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV.

First paragraph of Frank Hebert's Dune

Lively female voice:

Deep male voice:

What can it do?

Do icon
Zero-shot voice cloning: capture anyone’s voice from just a few seconds of reference audio.
Do icon
Exaggeration control: dial the expressiveness up or down for a more animated or subtle delivery.
Do icon
Pacing adjustment: speed up or slow down the speech to fit the desired rhythm.
Do icon
Free-text synthesis: type any words you like and the cloned voice will speak them aloud.

Install

Installing Chatterbox

info icon

info

If you do not have powerful enough hardware, then you can run it for free using Google Colab.

1
Installing ComfyUI

Visit their official website, and install it. ComfyUI is a graphic user interface designed specifically for Stable Diffusion workflows, and uses a node-based approach that allows users to visually construct their workflow. Although Chatterbox isn't stable diffusion, we can still use it for Chatterbox.

Interface of ComfyUI
Interface of ComfyUI
2
Installing Chatterbox
Custom Nodes Manager button
Custom Nodes Manager button

Go the the Custom Nodes Manager search for ComfyUI_Fill-ChatterBox and let it install. When you're done restart ComfyUI.

3
Setting up the nodes
Node Library Icon
Node Library Icon

Open the Node Library using (n) click on the Chatterbox folder then FL Chatterbox TTS and place that node on your canvas. Then go to the audio folder and add the LoadAudio and PreviewAudio node. Copy this basic set-up.

Interface of ComfyUI
Interface of ComfyUI
4
Generating audio

Go to the LoadAudio and upload an audio sample of a voice you like. Enter the sentence you want it to say. After a few minutes you can listen to the result at the PreviewAudio node.

info icon

info

I recommend keep_model_loaded on true if your hardware is beefy enough. This will reduce generation times when you run the model again.

Parameters

  • Exaggeration: how much "expressive/dramatisch" the output will be. Don't set this too high.
  • temperature: controls how much the output is allowed to differ from the original voice.
  • cfg_weight: controls the pacing of the output.
  • use_cpu: enable this if you've got weaker hardware.
  • keep_model_load: keeps model loaded in your vram. Enable this if you got a powerful GPU.

General use

  • The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts.[2]
  • If the reference speaker has a fast speaking style, lowering cfg_weight to around 0.3 can improve pacing.

Expressive or Dramatic Speech

  • Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.
  • Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.