This is the first Speech Enhancement API of its kind.
The API is capable of making three enhancements to speech, all powered by state of the art Deep Learning technologies.
/recover – fixes voice breakups in the audio caused by bad network conditions. Learn more about fixing voice breakups in this blog post
/denoise – suppresses background noise and leaves only human voice in the audio. Learn more about noise suppression in this blog post
/expand – expands the audio resolution from lowband to wideband. Learn more about bandwidth expansion in this blog post
An example request looks like this:
We see three use cases for the API:
Improve Speech Recognition accuracy (call centers, hospitals, banks, etc)
Enhance noisy Voicemails for playback (messengers, carriers sending voicemails)
Enhance Audio/Video Recordings for playback (podcasts, video blog, conference recordings, etc)
We have trained different Machine Learning models for these use cases. Below are the currently supported models. Over time we will add more models to the mix.
This model recovers voice breakups in a given 8kHz audio by producing an output which works best with Speech To Text (STT) models.
The model works only for audio that has been encoded/decoded with OPUS codec. Support for more codecs and sampling rates will be added down the road.
This model has a significant WER improvement when tested with state of art (Google, IBM, Bing, etc) STT APIs. We will share more in a separate blog post.
DENOISE_PLAY_8000 and DENOISE_PLAY_16000
Our engine recognizes what’s background noise and separates it from human voice in the audio.
These models suppress background noise in 8kHz and 16kHz audio respectively. In average they increase the MOS of audio by 1.2. This is a remarkable result.
A good use case for these models is voicemails or audio/video recordings that must be played back and heard by users.
This model expands audio from 8kHz sampling rate to 16kHz by predicting higher frequencies of human voice and filling them in.
A good use case where this model is effective is Speech To Text (STT). If your STT model is trained on 16kHz data but your inference data is 8kHz – you could use this model to expand your data to 16kHz and then run it through your STT model. The result will have an improved Word Error Rate (WER).