Skip to main content
Back to Blog
voice-aispeech-recognitionasrttsopen-sourcemicrosoftai-deploymenttranscription

Microsoft VibeVoice: What the Open-Source Voice AI Release Means for Real Businesses

Drakon Systems··6 min read
Share:
Microsoft VibeVoice: What the Open-Source Voice AI Release Means for Real Businesses

Voice AI moved another step forward this week. Microsoft has released VibeVoice, an open-source family of models covering both text-to-speech and speech recognition. The headline numbers are impressive, but the picture is more nuanced than the marketing suggests, and the practical answer for most UK businesses sits in one specific corner of the release.

Here is what is actually in the box, what was quietly removed, and where it fits in a real production stack.

What Is VibeVoice

VibeVoice is a research framework from Microsoft that publishes speech models under an open licence. There are three pieces:

  • VibeVoice-ASR-7B, a long-form speech-to-text model, released January 2026
  • VibeVoice-Realtime-0.5B, a small streaming text-to-speech model, released December 2025
  • VibeVoice-TTS-1.5B, a long-form multi-speaker text-to-speech model, released August 2025 and withdrawn in September 2025

The technical claim that ties them together is a continuous speech tokeniser running at an extremely low frame rate of 7.5 Hz. In plain English, that means the models can compress audio efficiently enough to handle very long inputs without losing fidelity, and process them in a single pass instead of slicing them into chunks. That single design choice is what makes the speech recognition model interesting.

The TTS Story Is Mixed

The original VibeVoice-TTS-1.5B made noise in the summer of 2025 because it could generate up to 90 minutes of multi-speaker speech with convincing voice cloning. That capability is also why Microsoft pulled the inference code six weeks after release. The published statement was diplomatic, but the practical problem was obvious — the model was being used to clone real people's voices in ways that were not consistent with the project's intent.

The model weights are still on Hugging Face, but without the official inference code, it is effectively dormant for most users. Anyone reaching for VibeVoice as a TTS replacement today is left with the smaller VibeVoice-Realtime-0.5B, which is genuinely available and well-built but does not displace established commercial TTS providers for most production use cases.

If your need is high-quality voice synthesis for a customer-facing product or a voice agent, options like ElevenLabs, Microsoft's own commercial Azure Speech, or open alternatives like Coqui XTTS are still the more practical starting point. The Realtime-0.5B model is interesting for fully local, multilingual streaming TTS where data never leaves your network. For UK schools, regulated industries, or anyone with a strict data-residency requirement, that is a real benefit. Outside that constraint, it is a sideways move.

VibeVoice-ASR Is the Genuine Win

The speech recognition model is the part of this release that actually changes what you can build.

VibeVoice-ASR-7B is a single model that handles up to 60 minutes of continuous audio in one pass, returns structured output with speaker labels and timestamps, and supports more than 50 languages. It accepts custom hotwords for domain-specific terminology. It was accepted as an oral presentation at ICLR 2026, which is not a marketing badge but a real signal of technical merit.

The practical contrast is with OpenAI's Whisper, which has been the default open-source choice for the last three years. Whisper is excellent for short clips, but it slices longer audio into 30-second windows. That works, but it loses global context — speakers drift between labels, names get re-spelled inconsistently, and long meetings come out as a list of fragments rather than a coherent transcript.

VibeVoice-ASR keeps the entire conversation in context, which means:

  • speaker identity stays consistent across the full hour
  • specialist terms introduced early are recognised reliably later
  • timestamps line up because the model is not stitching chunks together
  • diarisation is part of the model output, not a separate post-processing step

For anyone running transcription on long audio — meetings, interviews, training calls, classroom recordings, podcasts, support call archives — this is a meaningful step up.

Where This Lands in Production

We see four immediate use cases for UK businesses:

Meeting and call transcription. Sales teams, professional services firms, and consultancies generate hours of recorded calls every week. A pipeline that turns those into clean, searchable, speaker-labelled transcripts becomes much more reliable when the model holds the full conversation in mind. For Microsoft Teams, Zoom, or Google Meet recordings, this slots in cleanly.

Compliance-grade recording for regulated industries. Financial services, healthcare, and education all have recording obligations. Long-form ASR with speaker diarisation makes those recordings actually useful, not just stored. For UK schools in particular, accurate transcription of safeguarding interviews and parent meetings has both compliance and accessibility value.

Multilingual customer support. With more than 50 languages and consistent speaker tracking, support teams handling multilingual inbound calls can transcribe and route faster, without losing context when the call switches language mid-conversation.

Accessibility and captioning. Long-form video captioning, podcast transcripts, and webinar archives all benefit from coherent single-pass transcription. The result needs less manual cleanup before publishing.

The hardware cost is reasonable. The 7B model fits on a single 24 GB consumer GPU at half precision, which puts it within reach of in-house deployment for organisations that already run any local AI workload. Cloud deployment on a single H100 or L40S is straightforward.

How We Help

This is the kind of model we deploy for clients who need transcription that stands up to real-world content rather than clean podcast samples. A typical engagement covers:

  • evaluating VibeVoice-ASR against your actual audio (accents, background noise, jargon, multiple speakers) before you commit to it
  • building the pipeline around it — ingest, transcription, diarisation, post-processing, storage, search
  • integrating with your existing stack, whether that is Microsoft 365, Google Workspace, your CRM, or a bespoke platform
  • handling the data-residency questions that come with voice — what gets sent where, what is stored, who has access

For organisations that have been waiting for an open-source ASR model that handles long-form audio properly, this is the one to evaluate. For TTS, the open-source landscape has not changed as much as the headlines imply — the practical recommendations are still ElevenLabs, Azure Speech, or a self-hosted Coqui setup depending on your constraints.

If you are running transcription at any scale and want to know whether VibeVoice-ASR would actually improve your output, we can run a side-by-side evaluation on your own audio. The answer is usually clear within an hour of testing on real material.

Get in touch to discuss a voice AI evaluation for your business.

See how the wider Drakon Systems portfolio fits together

Explore finance automation, AI security infrastructure, education products, and implementation services from one product portfolio.

View Products