Show HN: Three new Kitten TTS models – smallest less than 25MB
534 points - last Thursday at 3:56 PM
Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.
Today we're releasing three new models with 80M, 40M and 14M parameters.
The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.
Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.
Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.
On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.
We would love your feedback!
Comments
BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers
So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.
I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.
Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?
Even the API looks like what you'd see in a manpage.
I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is
Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.
Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].
the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?
my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.
are there other models at this or better pricing i can be looking at.
If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.
I couldn't locate how to run it on a GPU anywhere in the repo.
Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.
Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
The iOS version is Swift-based.
Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?
the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.
Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
I want to be my own personal assistant...
EDIT: I can provide it a RTX 3080ti.
At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.
Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.
text ="""
Hello world. This is Kitten TTS.
Look, it's working!
"""
voice = 'Luna'
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of, Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.
Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
Collecting num2words (from kittentts==0.8.1)
Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting spacy (from kittentts==0.8.1)
Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
Collecting espeakng_loader (from kittentts==0.8.1)
Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for misaki>=0.9.4
I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs.