Introduction
Chatterbox is a free, open-source text-to-speech (TTS) engine that you run locally on your own computer, turning any text into lifelike audio. It was created by Resemble AI and according to them Chatterbox even beats ElevenLabs' TTS models. [1]
Example
A Beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV.
— First paragraph of Frank Hebert's Dune
Lively female voice:
Deep male voice:
What can it do?
Install
Installing Chatterbox
info
If you do not have powerful enough hardware, then you can run it for free using Google Colab.
Visit their official website, and install it. ComfyUI is a graphic user interface designed specifically for Stable Diffusion workflows, and uses a node-based approach that allows users to visually construct their workflow. Although Chatterbox isn't stable diffusion, we can still use it for Chatterbox.




Go the the Custom Nodes Manager search for ComfyUI_Fill-ChatterBox and let it install. When you're done restart ComfyUI.


Open the Node Library using (n) click on the Chatterbox folder then FL Chatterbox TTS and place that node on your canvas. Then go to the audio folder and add the LoadAudio and PreviewAudio node. Copy this basic set-up.


Go to the LoadAudio and upload an audio sample of a voice you like. Enter the sentence you want it to say. After a few minutes you can listen to the result at the PreviewAudio node.
info
I recommend keep_model_loaded on true if your hardware is beefy enough. This will reduce generation times when you run the model again.
Parameters
Exaggeration: how much "expressive/dramatisch" the output will be. Don't set this too high.temperature: controls how much the output is allowed to differ from the original voice.cfg_weight: controls the pacing of the output.use_cpu: enable this if you've got weaker hardware.keep_model_load: keeps model loaded in your vram. Enable this if you got a powerful GPU.
Recommended settings
General use
- The default settings (
exaggeration=0.5,cfg_weight=0.5) work well for most prompts.[2] - If the reference speaker has a fast speaking style, lowering
cfg_weightto around0.3can improve pacing.
Expressive or Dramatic Speech
- Try lower
cfg_weightvalues (e.g.~0.3) and increaseexaggerationto around0.7or higher. - Higher
exaggerationtends to speed up speech; reducingcfg_weighthelps compensate with slower, more deliberate pacing.