We discussed traditional multi-mic based noise cancellation in the previous post. Such technologies can be applied on user device (phone, laptop) only where multiple mics are available.
In this post we will discuss the challenges related with running noise cancellation technology on the Server Side.
When we’ve built a fully software based noise cancellation technology at 2hz.ai, a profound question came up — why can’t we run this technology on Sever side rather than phones or laptops?
There is a big value proposition for Communications Service Provider companies here: independent on what devices their users are using all these conversations can be noise cancelled at the backend side.
See, when a new iPhoneX with a better noise cancellation comes out — it doesn’t have much impact on a Service Provider such as Twilio, RingCentral, Fuze or WebEx. This is because iPhoneX is only a fraction of their overall device population. But if they could noise cancel (denoise) every communication independent on user devices — there is a big value in it.
Even more. When you are in the backend you have access to both legs of a call and you can denoise both legs. So you not only make your user’s life “noise-free” but potentially also all the other users they are talking to (users outside your network).
Sounds like a no brainer. However it isn’t as simple as it sounds. Let’s talk now about some challenges.
Well, first you need to have a technology which works on single mic source audio since obviously on the server side you cannot have access to dual mic sources. Your technology must perform as well as a dual-mic technology. We (at 2Hz) think this is not possible without Deep Learning techniques.
Assuming you have such tech let’s see what other challenges you’ll face.
Both denoising on device and denoising on server are technically challenging. While the former one requires high speed + low CPU resource + battery consumption the latter requires high speed + high concurrency + low power consumption. For an average service provider you must be able to process 10K of calls concurrently, for bigger onces this can be 100K concurrent calls. If your algorithms are based on Deep Learning — it might be impossible to support such concurrency in inference with CPUs. You would most likely need GPUs to scale it.
To give you an example, with an average size DNN, if the latest Intel CPU core can denoise up to 10 audio streams in parallel, a single medium AWS GPU can scale to up to 500 of such streams. Given that GPUs are not always easily available in datacenters and cloud services — this might become a challenge.
You need to think hard to make such a solution cost effective.
Another thing to worry about is end to end voice latency.
When two people talk over telephony the maximum latency that is still comfortable for their normal conversation is around 300ms. If the denoising technology introduces significant latency on the server side (due to the underlying math or simply the speed of the algorithm) — this might impact the overall perceptual quality of the call. If the latency is high this might end up introducing more jitter problems and hence more “voice chopping” which is in general a worse problem than the problem of background noise.
Special Dial Tones and Music
Imagine that you are calling your Bank’s call center. You hear a signal and then machine says “please hold while our representative assists you” and then music plays. After some time the same bot says the same sentence followed by music. This goes on until a human is ready to take your call.
If you plan to denoise call’s incoming legs as well — you need to have a strategy on how you can handle such situation. Apparently this is extremely difficult. Compared to outgoing call, where you can cancel everything but human voce, here you have complex things such as Music, Dial Tones and other special signals which you should leave intact.