Introduction
Chatterbox is a free, open-source text-to-speech (TTS) engine that you run locally on your own computer, turning any text into lifelike audio. It was created by Resemble AI and according to them Chatterbox even beats ElevenLabs' TTS models. [1]
Example
A Beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV.
— First paragraph of Frank Hebert's Dune
Lively female voice:
Deep male voice:
What can it do?
Install
Installing Chatterbox
info
If you do not have powerful enough hardware, then you can run it for free using Google Colab.
Visit their official website, and install it. ComfyUI is a graphic user interface designed specifically for Stable Diffusion workflows, and uses a node-based approach that allows users to visually construct their workflow. Although Chatterbox isn't stable diffusion, we can still use it for Chatterbox.




Go the the Custom Nodes Manager
search for ComfyUI_Fill-ChatterBox
and let it install. When you're done restart ComfyUI.


Open the Node Library
using (n)
click on the Chatterbox
folder then FL Chatterbox TTS
and place that node on your canvas. Then go to the audio
folder and add the LoadAudio
and PreviewAudio
node. Copy this basic set-up.


Go to the LoadAudio
and upload an audio sample of a voice you like. Enter the sentence you want it to say. After a few minutes you can listen to the result at the PreviewAudio
node.
info
I recommend keep_model_loaded
on true
if your hardware is beefy enough. This will reduce generation times when you run the model again.
Parameters
Exaggeration
: how much "expressive/dramatisch" the output will be. Don't set this too high.temperature
: controls how much the output is allowed to differ from the original voice.cfg_weight
: controls the pacing of the output.use_cpu
: enable this if you've got weaker hardware.keep_model_load
: keeps model loaded in your vram. Enable this if you got a powerful GPU.
Recommended settings
General use
- The default settings (
exaggeration=0.5
,cfg_weight=0.5
) work well for most prompts.[2] - If the reference speaker has a fast speaking style, lowering
cfg_weight
to around0.3
can improve pacing.
Expressive or Dramatic Speech
- Try lower
cfg_weight
values (e.g.~0.3
) and increaseexaggeration
to around0.7
or higher. - Higher
exaggeration
tends to speed up speech; reducingcfg_weight
helps compensate with slower, more deliberate pacing.