So You Want to Build an End-to-End Encrypted Web App

So it would be really nice if we could do browser-based end-to-end encrypted video calling, like we do between devices like a desktop app or a mobile app. We want this so that Zoom itself, or an outside party influencing Zoom or their infrastructure, cannot snoop on or interfere with these video calls. However, despite calling JavaScript-heavy web pages “apps”, a web app is a completely different beast than a desktop or mobile app. Web apps have a different, less secure-in-depth security model than installed desktop/mobile apps, but for ephemeral E2EE (end-to-end encrypted) calls with identity secret material of limited scope, this can be acceptable, and should be pursued, to support the large set of users on the web platform.

The Root of Trust

A web app is served from a web server and is usually (but not always) completely reloaded, served, and reprocessed by the web browser every time you load the page. There’s some caching involved, but usually every time you load the page, you are fetching the software to run your web app fresh from an edge server (and not necessarily a server controlled by the people who wrote the app), downloading it and running that software inside the a browser sandbox.

In a desktop app or mobile app, the app developer has a private signing key that has been attested to by some OS-level trusted signing authority, something like the Apple store or the Google Play store. The app developer uses this key to sign the application package. When you get the package onto your device, your OS checks the package’s signature using the developer’s public verification key, checking that the developer that supposedly signed it actually signed it, and it checks that those keys were attested to by the signing authority that you already trust, because the keys that you need to trust that authority are distributed with your OS.

If you don’t trust your OS, ‘welp’.

So this trust relationship chains up to Apple, to Google, etc, therefore you can both trust that that software came from the developer and that at least the keys that the developer used to sign it are trusted by the people who make your OS. Cool, great, sounds nice. You also do this check every time the developer pushes updates to that software or you download the latest version of the software, and check that this particular version of the software is actually the version that you’re supposedly going to trust and install on your computer, not just any version. This gives us very nice guarantees about the security and provenance of the software you run on your computer.

Reductio ad TLSum

Well, we don’t have any of that for the web. The closest thing we have for that sort of software delivery security model is that we trust the connection that we download this software over, but that’s it— and we only trust the nearest hop, because there may be many connection hops between our computer and the developer’s release server, or it might be cached on an edge server. We’re trusting in TLS (you are using TLS to serve your webapp, right?), and you can trust in TLS when you’re downloading signed software too; but for the web, you only trust in the connection, there’s nothing else to save you if you can’t trust that connection. Since you’re only trusting the nearest hop, if anyone can get on that connection between your computer and the web app server, they might serve you a different version of your web app, and you trust it because the only thing you are trusting in is the TLS connection. Because of this, it is also easier to discriminate targets using a web app than an installed app: since you only need to compromise the last, localized hop, you can narrow your scope and exposure to say, only the people that load the web app in New York City. To get a person to install a compromised installed app, you need to get a verifiable signed update and deliver it locally; that first part is harder. This is a fundamental part of why the web platform software security model is generally considered insufficient for high-security applications that need to indefinitely protect sensitive data or material locally.

However, even in mobile apps and desktop apps, we are getting closer and closer to updating almost every time you use the app. For example, Chrome and most modern browsers are called ‘evergreen’ because they update so frequently: you are almost always running the latest version of the browser every time you restart it. It’s almost guaranteed to be running a new version which was downloaded in the background and ready to go as soon as you restart the app. It’s similar for a lot of mobile apps: a lot of them update very frequently now, including your mobile browser. You’re downloading a new version of the software in the background, and then running new versions of the software, downloaded fresh from the internet, quite often.

So, there’s a bit of a convergence between installed apps—like mobile apps and desktop apps, that can be signed and checked— and web apps. Almost every time you run installed desktop/mobile apps, you’re running fresh software (although you can check where it comes from), whereas in the web browser, you’re running fresh software, which you trust for this use and this use only. On the web, you can’t guarantee where the software originally came from, only where it immediately came from (the server).

Zoom In, Enhance

Okay, so why do we care? For apps like Zoom or others that want to have end-to-end encrypted data going back and forth, such designs usually require you to create an ‘identity key’, or ID key. The identity key establishes ‘I am who I say I am (at least on this device)’. It can usually be used as a signing keypair, an asymmetric pair consisting of a private signing key and a public verifying key; people can use your public verifying key to check that only the person in control of the signing key it is paired with can generate signatures. Then you also have an ephemeral key, signed with you identity key, that you can use per message, or per video call, and throw it away after use. This is nice to have as compromising one ephemeral key does not compromise a whole history of calls or messages, just the data for which it was used (this property is known as ‘forward secrecy’). Now, the problem with building an E2E encrypted application securely via a web app is that the identity keypair has usually been long-term. For example, with Signal, when you install it on your phone, it establishes an ID keypair on your device, and then you put your verifying key for your long-term ID keypair on the public Signal server, along with some signed ephemeral keys (they call them ‘prekeys’). When other people want to communicate with you, they can fetch this public data from the Signal server and do a handshake with it, establishing a shared secret with which to start encrypting chat messages. Your signing key stays on your device forever, or until you reinstall the app. The reason this works is that you want to tie all your persistent communications back to that long-term identity key, via hashing and ratcheting it to generate your ephemeral keys, tieing the identity you first start communicating as to all the messages that you exchange. You also want to be able to detect if something happens to their public keys, so you can detect interceptions or compromises or MITM’s (meddlers-in-the-middle).

TOFUrkey

It’s generally harder to do these things securely on the web because it’s harder to protect long-term secrets in a web browser. The entire platform is ephemeral: the app is ephemeral, the software that you download is ephemeral, the communications that you’re doing are ephemeral. At a software level, it’s hard to keep something safe in the browser if the software you’re relying on to keep it safe is changing every time you load the page, and you can’t verify where the changes came from. There are many, many opportunities for someone to mess with your software, because there are many, many times in normal app usage where all the software is being fetched over the internet, and implicitly trusted because only the last connection hop was trusted.

Consider this scenario: an adversary can pick just one opportunity to MITM your connection to the server, and that’s the one and only time they modify the web app code to exfiltrate your long term secret key from your browser’s storage. The next time you load the page, you never see that compromised code again. But what if your actual application does not necessarily need a long-term identity?

For ephemeral audio or video calls—not long-term persistent records such as text messages—do you need a long-term identity key, especially if you are tying identity to an existing outside identity provider? In that case, the identity provider is the one that’s verifying you are who you say you are, and that you are authorized to access things and take actions as this-or-that identity. Even independent of an identity provider, your ID could just be the ‘you’ who got access to meeting data, an ID scoped just for that meeting, while the meeting host is able to identify you and your access point; a personalized meeting ID code that should be unguessable and unforgeable, handed out by the host. What if you generate your ID keys just for that meeting? We trust that the person who got that meeting information is the person we intended to get information, because we trust that channel on first use, and then they can access the call, generate their keys, so that they can sign and and we can verify that they are who they say they are in the context of that call alone, then do all the E2EE communication as before. The host of the meeting can manage those identities per meeting: we can boot people on a per-user identity that only exists for this call, and block them from rejoining, because they should only know their per-meeting participant ID—and can’t forge another one.

When the call is finished, we throw all the key material away. Because all the key data is thrown away at the end of the meeting, there’s nothing to protect into the future, not even the software! The chain of trust on a per-meeting basis is much shallower: the software you downloaded for this one call, and the channel that you delivered the meeting information over at the time it is transmitted. If the call is ephemeral, the web app can be trusted just for the duration of the meeting.

That design seems a lot more amenable to being implemented in a web app, because this basically boils down to TOFU (trust on first use), but the trust does not perpetuate across uses, so it’s more like, TOAU (trust on any use). The trust is ephemeral, the meeting is ephemeral, the ID is ephemeral. For a lot of meetings, this is perfectly acceptable. This is also akin to our trust model of phone calls in the regular world: whoever is receiving your call is the person who has access to the callee phone number at the time of dialing, and then as soon as the call is over, you’re done, and all ‘trust’ is pretty much thrown away. If someone else gets that phone number in the future, then you either trust them or you don’t, for that call. You might expect that dialing that number will get us the same person next time, but it’s not guaranteed, and knowing this, if the voice and/or face on the other end doesn’t match our expectation, we can hang up. This is one of the advantages video calls offer us vs persistent text chat: being able to establish trust quickly on the call, as it’s harder to fake video or voice in real-time (not impossible, but harder).

If you want to tie these permitting identities to some other long-term ID, you can use your identity provider (such as G Suite accounts, private corporate contact server, etc) to manage accounts, and then you can attest that the ephemeral meeting ID keys are bound to those managed identities, that must by definition be authenticated and authorized. But after the meeting is over, you can throw away those per-meeting ID keys, or, if you have some integration with your ID provider, you can log them over time as a sort of key transparency, if you want. If the meeting generates per-participant IDs that are unguessable and unforgeable—which we can do cryptographically with something like an HMAC—then we can manage those users per-meeting the way that we do anyone that’s attached to long-term identity if they’re being misbehaving. If you need to manage abusive people across meetings, that is where you would bring in your long-term identity provider. For one-time use calls, or small-access calls, you don’t need a long-term identity, and if you don’t need a long-term identity, you can do this in a browser. If you do need long-term identity, you can generate a fresh ID key for every call, attested to by that managed identity, and that can also be done in the browser.

This is not just a theoretical either: Google Duo supports E2EE group calls on Android, iOS… and web! The web clients require you to log into your Google accounts, but then it does a group handshake to set up the call, and every time a user joins or leaves. All key material is thrown away at the end of the call. Because calls are tied to your long-term Google ID, you can block users across calls, and report abusive users to Google. It would be great if Google could publish more technical materials about Google Duo so we could learn from their experience designing, building and deploying secure E2EE encrypted calling apps that work in the browser. ?

When designing E2EE protocols for persistent vs ephemeral applications, we need to figure out where we need long-term identity in terms of cryptographic keys, and where we don’t. If we don’t need long-term identity, we don’t need to protect as much secret key data in our software platform indefinitely: we only need to protect it ephemerally, which then means we might do this sort of thing on the web. If we can do it on the web, we can give access to many users that desktop apps and mobile apps exclude, which I think is a huge win.