VibeVoice: Open-source frontier voice AI (github.com)

372 points by tosh a day ago

steinvakt2 a day ago

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

terbo 17 hours ago

It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.

lblock 21 hours ago

Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too

xnx 20 hours ago

Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/

realty_geek 19 hours ago

GuinansEyebrows 20 hours ago

there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

ramon156 21 hours ago

well duh, they updated the news section

https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...

which is microsoft for "we removed two dead links". AI innovation knows no limits!

Vinnl 20 hours ago

gagan2020 20 hours ago

It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

Stagnant 16 hours ago

The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.

tjungblut 15 hours ago

yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.

narrationbox 15 hours ago

Yes, the SOTA is currently much more advanced.

steinvakt2 4 hours ago

What do you consider to be SOTA?

zuzululu 18 hours ago

you saved us a lot of time here.... i unstarred the repo

moving on....

Capricorn2481 18 hours ago

I don't really pay attention to stars. Do people use them as bookmarks? Why would you star a repo if you knew so little about it?

drusepth 17 hours ago

einsteinx2 17 hours ago

tombert 17 hours ago

Tamatarr 17 hours ago

Saved a lot of my time thanks!

tombert 18 hours ago

I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.

logicchains 17 hours ago

Imagine the balls it took to willingly attach the Microsoft label to the front of the product that is Teams.

tombert 16 hours ago

scotty79 20 hours ago

You just saved me an afternoon.

maxloh a day ago

I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

https://github.com/microsoft/VibeVoice/issues/102

jcmfernandes 21 hours ago

Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.

MarsIronPI 20 hours ago

If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D

jcmfernandes 20 hours ago

psychoslave 20 hours ago

simonw 20 hours ago

I'm reserving that complaint for "open source" models which are released under non-open-source licenses.

I care that I know what I can DO with the project when I see it described as "open source".

yjftsjthsd-h 20 hours ago

> I care that I know what I can DO with the project when I see it described as "open source".

Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.

simonw 19 hours ago

rogerrogerr 19 hours ago

data-ottawa 20 hours ago

That would be “permissive license”

Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.

It’s simple enough an idea.

JumpCrisscross 21 hours ago

> we should stop calling this type of model open source. They are indeed "open weight”

This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

andy_ppp 21 hours ago

I think you mean GIF.

engeljohnb 18 hours ago

The inventor of GIF didn't begin with a document* clearly laying out what is and isn't to be called a "GIF."

I think it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source."

*https://opensource.org/osd

JumpCrisscross 18 hours ago

keeda 16 hours ago

giancarlostoro 21 hours ago

It's the same as GIS, you wouldn't say jizz now would you?

DoctorOW 21 hours ago

ziml77 20 hours ago

dijksterhuis 21 hours ago

kevin_thibedeau 21 hours ago

notabotiswear 21 hours ago

pardon_me 21 hours ago

WarmWash 21 hours ago

And "hallucination" which should have been "delusion".

Way early on (spring 2023) people tried to stop it, but no luck.

MagicMoonlight 20 hours ago

WhyNotHugo 19 hours ago

Devils advocate here: I can give you a binary of my open source MIT code and never phone you the code. The code is still MIT licensed, and open source. You just have no access to it.

That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.

Otek 19 hours ago

? Do you know what “source” means in open source? Like, what is the source of the binary? It’s the code. That’s the source in open source.

freedomben 19 hours ago

freedomben 19 hours ago

In their defense, most everyone else does the same thing. They still shouldn't do it, but at least they're not the trendsetter here (though they are contributing to the ongoing problem)

btown 20 hours ago

At least it's MIT licensed! As much as non-open training data irks me, restrictive licensing irks me more!

cute_boi 19 hours ago

what is problem with restrictive licensing? Most of them starts if you have 1M users etc?

bitvvip 20 hours ago

What you said makes a lot of sense. Free software should not be confused with open source

giancarlostoro 21 hours ago

I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

scotty79 20 hours ago

Open weights is not exactly right either because we do get source of the software that uses those open weights.

Maybe open inference?

But we often also get source code for fine tunning the model.

So maybe it's closer to open source than to anything else?

Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?

jrm4 20 hours ago

I'm genuinely torn on this one; I get technically why not, but why I think I have no problem with it is the wishy-washiness of "open source" generally.

As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.

notabotiswear 21 hours ago

Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.

dist-epoch 21 hours ago

it was replaced with abundancewashing

Geezus_42 20 hours ago

isodev 18 hours ago

I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...

pluc 21 hours ago

Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243

tacticus 17 hours ago

got to love how they're trying to hide the links.

embedding-shape a day ago

Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?

542458 a day ago

Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.

infecto 21 hours ago

It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.

Barbing 21 hours ago

[off topic]

When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

People will also post their own interpretations in response to comments, and quickly find out they missed something.

… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

[on topic]

(OK I’m done making excuses, time to read the article… thanks for the encouragement!)

I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

ipotapov 4 hours ago

I built speech-swift, which focuses on on-device speech processing like VibeVoice, but specifically leverages Apple Silicon's capabilities for ASR, TTS, and VAD without cloud dependency. Our ASR supports 52 languages with a real-time factor of 0.06. https://soniqo.audio/benchmarks

aqme28 21 hours ago

Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.

accrual 21 hours ago

Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.

ryandrake 19 hours ago

In my mind, Vibe-anything means "some slop carelessly thrown together to ship as fast as possible." Wild that it's being used in a serious product name!

Barbing 21 hours ago

I’m just surprised they put the name of the e-waste slop company in their product

amlib 18 hours ago

Maybe they were trying to make a pun on "Via Voice", the cursed IBM STT from the 90s?

lvncelot 19 hours ago

I'm honestly more surprised that they could resist the temptation to call it Copilot

tempodox 16 hours ago

Microslop Copilot for Voice! After they renamed Office, they surely will rename this one, too.

CubsFan1060 a day ago

Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/

542458 a day ago

Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.

JumpCrisscross 21 hours ago

“VibeVoice can only handle up to an hour of audio”

Why?

Anonyneko a day ago

You have selected Microsoft Sam as the computer's default voice.

accrual 21 hours ago

My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.

ryukoposting 21 hours ago

Holy moly, a Microsoft AI product that isn't named Copilot!

DoctorOW 21 hours ago

Missed opportunity to call it Vopilot

silverwind 20 hours ago

Slopilot

podgietaru a day ago

So we've really just settled on Vibe as the verb for AI then?

giarc a day ago

I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?

internet_points 21 hours ago

it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse

pryanshu89 a day ago

Why use precise technical language when you can just vibe with your AI system?

xnx 20 hours ago

Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.

scotty79 19 hours ago

It's crazy that a lot is happening in open models for stt, but there's very little progress when it comes to results, esp multilingual.

triage8004 17 hours ago

Surprised it wasn't called Copilot Voice

chaosprint 21 hours ago

Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:

https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

mberg 19 hours ago

I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.

vijgaurav 16 hours ago

The 60-minute single-pass transcription is the part that actually matters. Most ASR models chunk audio and you lose speaker continuity across boundaries. If the diarization actually holds up on hour-long recordings without drifting, thats a real solve for podcast and meeting transcription workflows.

frangonf 21 hours ago

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

woodson 16 hours ago

It highly depends on the sort of data you’re processing (phone calls, podcasts, meetings of more people recorded using single channel?). For NVIDIA/NeMo, check out their softformer diarization models (also streaming).

vicchenai 10 hours ago

the built-in diarization is the one thing that actually caught my attention here. running whisper + pyannote separately is a pain for long recordings and the speaker continuity breaks at chunk boundaries. if this handles it in a single pass that's a real workflow improvement, regardless of how the raw accuracy benchmarks compare

Void_ a day ago

I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

olejorgenb 21 hours ago

I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)

Barbing 21 hours ago

Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?

2ndorderthought 21 hours ago

Have you tried qwen?

SecretDreams 21 hours ago

Any non-Musk alternatives that are comparable in quality and cost?

jayphen 21 hours ago

Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)

Void_ 21 hours ago

Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.

JumpCrisscross 21 hours ago

What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?

yreg 21 hours ago

Locally maybe https://voicebox.sh/

Elevenlabs in the cloud.

chrsw 21 hours ago

Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.

khimaros 21 hours ago

open weights i would say S2: https://github.com/rodrigomatta/s2.cpp

Mobius01 20 hours ago

Microsoft has historically made poor choices in product naming, but this has to be a new low.

dragonfax 18 hours ago

Shouldn't it be called something like "Copilot Voice"?

Narishma 17 hours ago

That's not confusing enough. It should be just Copilot.

lizardking 13 hours ago

Microsoft continues to be completely incapable of coming up with good names for their products and services

yayadarsh 18 hours ago

Someone tell me if this is better or worse than Parakeet

low_tech_punk 16 hours ago

When mixing languages, why does the English have Chinese accent and Chinese have English accent? Is it a feature or bug?

BlastBash192 21 hours ago

Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.

threepts 17 hours ago

Explains most of the shit they have pushing with Windows 11. Perhaps all that bloatware was VibeVoiced too.

solomatov 19 hours ago

It would have been better if they provided not just weights, but also some frontend where it is usable as is.

isolay 16 hours ago

Seriously, VibeVoice? Microslop really has a penchant for the worst names.

mistic92 21 hours ago

For me its giving me very poor results

khimaros 21 hours ago

looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested

nickandbro 18 hours ago

This is a very good model, but can it be run on the web?

unixhero 17 hours ago

What the do they mean by frontier voice

decide1000 15 hours ago

Isn't voxtral much better?

yapyap 10 hours ago

Sounds like Msft wanted to coast on the “vibecode” vibe popularity?

walthamstow a day ago

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck

Zopieux 20 hours ago

English only?

ChrisArchitect 20 hours ago

simonw 20 hours ago

That was about the text-to-speech model, the speech-to-text one was release in January.

starkeeper 20 hours ago

Microsoft is famous for choosing terrible names but how could they be this terrible.

simjnd 16 hours ago

What a terrible name

villgax 19 hours ago

lol they rug-pulled the 7B for our own safety some months ago