Changing Volume on Remote WebRTC Streams in Chrome

This post is about the difficulties we faced in performing a simple operation: changing the volume of a audio stream. Imagine having a group call with your Skype pals with binary volume control—zero or maximum. You could lower the volume on your speaker, but some of your friends may have low sensitivity on their microphones rendering them inaudible. This was the horror that we had to overcome. Getting volume-control to work took us down a rabbit hole of browser bugs that have spanned upwards of seven years. We did finally manage to get it working however, and this post documents our solution to this problem.

If you’ve had any significant experience in building web pages, you’re probably well-versed with browser differences. These differences are a direct consequence of the ways in which specs are implemented. On one hand, having a good choice of browsers in the market; each built on different codebases is desirable. On the other, we need to deal with these differences that pop up. One such difference that we struggled with for a long time was volume control, specifically on remote WebRTC MediaStreams in Chrome.

On twoseven.xyz, we use WebRTC to enable voice and video chat—a critical feature of our website, and controlling the other participant’s volume is an integral part of this experience. The simple-peer library that we use to establish WebRTC connections gives us access to the remote user’s stream which we then attach to a <video> element. We did this in the hope that controlling the remote participant’s volume will be as simple as controlling the <video> element’s volume. We’ve never been more wrong. Initially, it felt like everything was working as expected on Chrome and the participant’s volume decreased as we changed the video element’s volume; however, this didn’t work across browsers. Even on Chrome, it didn’t work on Linux and MacOS. The only bug we could find regarding this was from 2012 and it was marked as status: fixed. Further digging revealed that the right way to control volume was to hook up the remote MediaStream to the AudioContext API and then use a GainNode to control volume. Here’s a sample implementation:

setupGainNode (stream) {
  // We assume only one audio track per stream
  const audioTrack = stream.getAudioTracks()[0]
  var ctx = new AudioContext()
  var src = ctx.createMediaStreamSource(new MediaStream([audioTrack]))
  var dst = ctx.createMediaStreamDestination()
  var gainNode = ctx.createGain()
  gainNode.gain.value = 1
  // Attach src -> gain -> dst
  ;[src, gainNode, dst].reduce((a, b) => a && a.connect(b))
  stream.removeTrack(audioTrack)
  stream.addTrack(dst.stream.getAudioTracks()[0])
}

This solution wasn’t perfect either. The moment we added this code into the flow, Chrome stopped generating all WebRTC audio. It did work on Firefox, however. Unfortunately, despite all the love we have for Firefox, it still accounts only for about 10% of the browser market. And so we needed to get this working on Chrome. Googling for Chrome’s behavior threw up a couple of bugs 1, 2. Essentially, the takeaway from these issues is that it is a Chrome-specific bug and that the stream needed to be attached to a muted <audio> element to get it to work. We tried this as well by adding the following code, but to no avail:

const audio = document.createElement('audio')
audio.style.display = 'none' // We don't want to show this
audio.muted = true
audio.srcObject = stream
audio.play()
webcamContainer.appendChild(audio)

Our Solution

The comments on these bugs hinted at the AudioContext API not working as expected on just remote streams and not all streams. This gave us an idea. What if we added the GainNode on the sender end rather than the receiver end? We would need to clone the original stream for each peer, and attach a GainNode to each of these cloned streams. We began architecting our solution.

                                                       +----------------+    stream    +---------------+ remote-stream +---------------+
                                                       |                +------------->+               +-------------->+               |
                                                   +-->+    GainNode    |              |  simple-peer  |               |     video     |
                                                   |   |                +<-------------+               +<--------------+               |
                                                   |   +----------------+     volume   +---------------+     volume    +---------------+
                                                   |
                                                   |
+------------------+     +------------------+      |
|                  |     |                  |      |
|   getUserMedia   +---->+   local-stream   +------+
|                  |     |                  |      |
+------------------+     +------------------+      |
                                                   |
                                                   |
                                                   |   +----------------+    stream    +---------------+ remote-stream +---------------+
                                                   |   |                +------------->+               +-------------->+               |
                                                   +-->+    GainNode    |              |  simple-peer  |               |     video     |
                                                       |                +<-------------+               +<--------------+               |
                                                       +----------------+     volume   +---------------+     volume    +---------------+

When a user changes the volume slider on a participant’s video, we send that event directly back to the participant via the RTCPeerConnection using simple-peer. The participant reacts to the event by changing the value of the GainNode associated with this peer connection. In this way, we transparently handle volume changes without the user knowing the complexity of the operation. The biggest downside to this approach is latency. On connections that span different parts of the globe, each movement of the slider would need to be sent across to the other end causing perceptible delays.