Show HN: Three new Kitten TTS models – smallest less than 25MB

534 points - last Thursday at 3:56 PM


Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.

Today we're releasing three new models with 80M, 40M and 14M parameters.

The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.

Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.

Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.

On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.

We would love your feedback!

Source

Comments

dawdler-purge yesterday at 12:04 AM
I created a CLI wrapper for Kitten TTS: https://github.com/newptcai/purr

BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers

So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.

kevin42 last Thursday at 4:40 PM
What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

g58892881 yesterday at 4:31 PM
I created a demo running in the browser, on your device: https://next-voice.vercel.app
__fst__ last Thursday at 10:29 PM
Was playing around a bit and for its size it's very impressive. Just has issues pronounciating numbers. I tried to let it generate "Startup finished in 135 ms."

I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.

daneel_w last Thursday at 8:12 PM
A very clear improvement from the first set of models you released some time ago. I'm really impressed. Thanks for sharing it all.
geokon yesterday at 6:56 AM
Very cool :) Look forward to trying it out

Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?

Even the API looks like what you'd see in a manpage.

I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is

ks2048 last Thursday at 4:46 PM
You should put examples comparing the 4 models you released - same text spoken by each.
_hzw last Thursday at 11:49 PM
I'd love to see a monolingual Japanese model sometime in the future. Qwen3-tts works for Japanese in general, but from time to time it will mix with some Mandarin in between, making it unusable.
jacquesm yesterday at 2:36 AM
Good on device TTS is an amazing accessibility tool. Thank you for building this. Way too many of devices that use it rely on online services, this is much preferred.
nsnzjznzbx last Thursday at 8:47 PM
They sound like cartoon voices... but I really like them I could listen to a book with those.
PunchyHamster last Thursday at 10:37 PM
I ran install instructions and it took 7.1GB of deps, tf you mean "tiny" ?
bobokaytop yesterday at 8:11 AM
The size/quality tradeoff here is interesting. 25MB for a TTS model that's usable is a real achievement, but the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. Curious how this performs on something like a Raspberry Pi 4 for real-time synthesis. The voice quality tradeoff at that size usually shows up most in prosody and sentence-final intonation rather than phoneme accuracy.
altruios last Thursday at 4:39 PM
One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

anilgulecha yesterday at 5:35 PM
To the folks and Kitten team: I'm working on TTS as a problem statement (for an application), and what is the best model at the latency/cost inference. I'm currently settling for gemini TTS, which allows for a lot of expressiveness, but a word at 150ms starts to hurt when the content is a few sentences.

my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.

are there other models at this or better pricing i can be looking at.

ks2048 last Thursday at 4:52 PM
There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

jamamp last Thursday at 11:56 PM
The Github readme doesn't list this: what data trained this? Was it done by the voices of the creators, or was this trained on data scraped from the internet or other archives?
boutell last Thursday at 8:12 PM
Great stuff. Is your team interested in the STT problem?
arcanemachiner last Thursday at 11:18 PM
Fingers crossed for a normal-sounding voice this time around. The cute Kitten voices are nice, but I want something I can take seriously when I'm listening to an audiobook.
armcat last Thursday at 6:42 PM
This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!
pumanoir last Thursday at 6:39 PM
The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

swaminarayan yesterday at 2:38 AM
How did you make a very small AI model (14M) sound more natural and expressive than even bigger models?
magicalhippo last Thursday at 5:31 PM
A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

stbtrax yesterday at 1:38 AM
Did they train this on @lauriewired's voice? The demo video sounds exactly like her at 0:18
devinprater last Thursday at 5:38 PM
A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.
fwsgonzo last Thursday at 4:51 PM
How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

vezycash last Thursday at 6:52 PM
Would an Android app of this be able to replace the built in tts?
spyder yesterday at 3:00 PM
Nice, but it's weird that no "language" or "English" is mentioned on the github page, and only from the "Release multilingual TTS" Roadmap item could I guess it's probably English only for now.
agnishom yesterday at 2:44 AM
I thought they were going to make kitten sounds instead of speech
ilaksh last Thursday at 4:35 PM
Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

baibai008989 yesterday at 11:06 AM
the dependency chain issue is a real barrier for edge deployment. i've been running tts models on a raspberry pi for a home automation project and anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case.

curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?

the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.

tim-projects yesterday at 7:01 AM
Only American voices? For some reason I'm only interested in Irish, British or Welsh accents. American is a no
amelius last Thursday at 9:40 PM
How long until I can buy this as a chip for my Arduino projects?
Stevvo yesterday at 7:43 AM
Found they struggle with numbers. Like, give them a random four digit number in a sentence and it fumbles.
pabs3 yesterday at 5:48 AM
Is this open-source or open-weights ML?
DavidTompkins last Thursday at 5:48 PM
This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)
great_psy last Thursday at 4:33 PM
Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

sroussey yesterday at 4:53 AM
It is based on onnx, so can i use with transformers.js and the browser?
sschueller last Thursday at 6:19 PM
I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

schopra909 last Thursday at 6:33 PM
Really cool to see innovation in terms of quality of tiny models. Great work!
gabrielcsapo last Thursday at 7:06 PM
are there plans to output text alignment?
rsmtjohn yesterday at 7:33 AM
The <25MB figure is what stands out. Been wanting to add TTS to a few Next.js projects for offline/edge scenarios but model sizes have always made it impractical to ship.

At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.

erkoo yesterday at 9:41 AM
How noticeable is the difference in quality between the 4M model and the 80M model?
janice1999 last Thursday at 6:41 PM
What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?
wiradikusuma last Thursday at 5:31 PM
I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?

Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.

Tacite last Thursday at 4:36 PM
Is it English only?
whitepaper27 last Thursday at 6:59 PM
This is great. Demo looks awesome.
deathanatos yesterday at 1:37 AM
So, one thing I noticed, and this could easily be user error, is that if I set the text & voice in the example to:

  text ="""
  Hello world. This is Kitten TTS.
  Look, it's working!
  """

  voice = 'Luna'
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of,

  Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.

edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.

pabs3 yesterday at 5:46 AM
Whats the training data for this?
exe34 last Thursday at 8:36 PM
sounds amazing! does it stream? or is it so fast you don't need to?
moralestapia last Thursday at 8:52 PM
Wow, what an amazing feat. Congratulations!
tredre3 last Thursday at 11:33 PM
This is something I've been looking for (the <50MB models in particular). Unfortunately my feedback is as follows:

      Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
    Collecting num2words (from kittentts==0.8.1)
      Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
    Collecting spacy (from kittentts==0.8.1)
      Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
    Collecting espeakng_loader (from kittentts==0.8.1)
      Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
    INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
    ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
    ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
    ERROR: No matching distribution found for misaki>=0.9.4

I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs.
JulianPembroke yesterday at 10:50 AM
[dead]
catbot_dev yesterday at 3:05 PM
[dead]
Remi_Etien last Thursday at 5:49 PM
25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?
takahitoyoneda yesterday at 1:38 PM
[dead]
eddie-wang yesterday at 1:43 AM
[flagged]
aplomb1026 last Thursday at 10:07 PM
[dead]
openclaw01 yesterday at 1:12 AM
[dead]
ryguz last Thursday at 7:58 PM
[dead]
devnotes77 last Thursday at 7:01 PM
[dead]
devcraft_ai yesterday at 8:30 AM
[dead]
adriencr81 last Thursday at 8:30 PM
[dead]
blackoutwars86 yesterday at 12:58 AM
[dead]
5o1ecist yesterday at 6:46 AM
[dead]
Iamkkdasari74 last Thursday at 6:13 PM
[dead]
rcdwealth last Thursday at 10:12 PM
[dead]