In early 2020, Zoom grew from 10 million users per day to over 300 million users per day.
That kind of growth requires some serious scaling, and they pulled it off with minimal service disruptions. Makes you wonder what technology zoom is built on!
This ability to scale was due in large part to Zoom’s technical architecture and infrastructure design. But being able to scale isn’t the only thing that makes Zoom such a powerhouse. It also offers highly reliable video and voice conferencing. Again, this is thanks to Zoom’s tech stack and proprietary protocols.
In this article, we’re taking a deep dive into Zoom’s tech stack and architecture. You’ll learn how Zoom was able to scale so rapidly, how they implemented such a reliable video streaming service, and how you can build a video conferencing app like Zoom.
Let’s start with scaling.
How Zoom Managed To Scale 224% in 3 Months
Like many tech companies, Zoom was designed to scale. Pre-pandemic (early 2020), their 19 interconnected data centers were running at half capacity. This gave them plenty of wiggle room to absorb the initial surge in users.
Even so, they still had to improvise.
Zoom uses both AWS and Oracle servers for non-meeting functionality such as call scheduling and participant management. They usually use their own data centers for the actual video calls. However, if their data centers are being over-taxed, Zoom can offload these video calls to the AWS or Oracle servers.
This option of as-needed resources enables Zoom to remain flexible and scale quickly. They can easily pull the lever to expand or contract their AWS resources to keep the meetings running smoothly.
For example, Zoom can host some of its calls on AWS and Oracle servers during peak meeting hours. Then, when things quiet down again, they can pull back.
Zoom’s Server Infrastructure
We already covered the fact that Zoom has 19 interconnected data centers. What we didn’t mention earlier is that these data centers are spread across the globe. This decision was made to reduce the latency of video calls.
The closer a device is to a data center, the quicker the communication is received. So, Zoom uses geolocation to pinpoint which data center is closest to each user and then ports all information flow through that center.
Zoom's server infrastructure is one of the many system architecture decisions that were made with the user experience in mind.
Zoom’s Video Architecture
As we just mentioned, user experience was a top priority when it came to designing the Zoom system architecture. This brings us to the core of Zoom’s success: optimized video architecture. Zoom provides a fantastic video experience, navigating video’s demanding requirements with ease.
Zoom’s video architecture has four key features:
1. Distributed Architecture
As we already mentioned, Zoom’s data centers are spread across the globe. This distributed network decreases latency, giving users higher quality video experiences in every meeting—no matter where they are located.
2. Multimedia Routing
Each participant in a video call generates multiple video streams of different qualities (e.g. 360px, 720px, and 1080px). Zoom’s multimedia router (MMR) acts as an MCU (Multipoint Control Unit). In other words, it's responsible for identifying and sending the appropriate stream (with the right quality) to other participants. In a typical MCU system, the identification step occurs before the stream is sent to the client—so only one stream is sent.
However, that requires heavy computing, which vastly limits the scalability of the system. Zoom works a different way—first sending all the streams from a video participant to the client and then identifying the appropriate stream.
By separating the video processing from routing, Zoom significantly reduces the amount of computing that's required. With Zoom's MMR system, up to 15x more people can participate in a video conference than a typical MCU-powered system would allow.
3. Multi-bitrate Encoding
Usually, a client would need to encode and decode multiple streams to provide different resolutions of video and audio. However, Zoom manages different network capabilities and devices using multi-bitrate encoding. This multi-bitrate encoding allows a single stream to adjust to multiple different resolutions by itself—providing higher reliability and quality.
4. Application-level QoS (Quality of Service)
Normally, QoS technologies are deployed on the network layer (i.e., before the data is sent to the client-side application). Zoom, however, created a proprietary QoS solution that lives on the application itself. This allows Zoom to optimize the audio, video, and screen-sharing experience for each specific device on which the application is loaded.
Now that we covered the high-level solutions Zoom employs to offer such reliability at scale, let’s get into the nitty-gritty and talk about their specific tech stack.
Zoom’s Front-End Technology Stack
Let’s start with the frontend and work our way to the backend.
Zoom offers an app for all the popular platforms: iOS, Android, Web, PC, and MacOS. Surprisingly, Zoom built a native app for each platform rather than a hybrid app. Here are all of Zoom's supported platforms with the front-end language(s) that were used to build each one:
- Android: Java
- iOS: Swift
- Mac Desktop app: Swift/Objective-C
- PC Desktop app: C/C#/Java
Zoom uses multiple different types of network protocols to communicate between clients and servers: UDP, TCP, SSL and P2P.
If there are only two participants, a peer-to-peer protocol is used. However, if there are more than two participants, Zoom uses a fallback strategy. For example, when a client connects to the server, it attempts to do so via UDP. If that doesn’t work, it tries via TCP. And if that fails, it tries via SSL. This flexibility in terms of which network protocol the client uses leaves room for a lot of optimization.
Zoom defaults to UDP because that protocol doesn’t care about packet loss (small bits of data that don’t make it to their destination). This means UDP has less overhead while providing the most real-time, low latency data transfer available—important metrics for video and voice calls.
Unlike UDP, TCP waits for the missing packets to come through before continuing, which causes higher latency and delay in the video. So, it's a fallback choice only used when UDP isn't working. SSL is slower than both UDP and TCP, leaving it third in line only to be used when necessary.
How Zoom Works: A High-Level Architecture
Let’s zoom out (heehee) and take a high-level look at how Zoom’s components work together.
Here's a Zoom architecture diagram that will help you visualize the core components as we discuss them.
Zoom client refers to the app an individual uses to participate in a conference call. As we mentioned already, Zoom offers apps for iOS, Android, Web, PC, and Mac. However, no matter which app an individual uses, the way it communicates with the rest of Zoom’s architecture stays the same.
Zoom Data Center
The Zoom data center houses the Meeting Zones. Each meeting zone consists of an MMR (Multimedia Router) and a Zone Controller.
A Meeting Zone is a cluster of servers, usually physically co-located, that host a Zoom call. These Meeting Zones can be located in one of Zoom’s data centers or on an organization’s network (if they use Zoom’s on-premise solution).
A Zone Controller is responsible for all the activity in a given Meeting Zone. It manages new connections and monitors the server load.
As the name suggests, the Multimedia Router is responsible for distributing the audio and video streams to the correct participants in a Meeting Zone.
This component of the system design hosts the zoom.us website and multiple internal APIs. The website and APIs are leveraged by both external developers and other pieces of Zoom’s architecture. For example, when you first start a video call, the Web Infrastructure determines which Zoom Meeting Zone to use.
HTTP tunnels are an important part of Zoom’s reliability strategy. Existing on each Meeting Zone and in the Public Cloud, these tunnels offer participants a point of connection should every other network connection strategy fail.
How To Build a Video Conferencing App Like Zoom
With more and more companies going fully remote, video conferencing technology is experiencing quite a boom.
If done correctly, building your own video conferencing app can be very lucrative. While building a company like Zoom requires deep pockets, great marketing, and brilliant engineers—building a video chat app like zoom can be easy as a breeze.
You’ll first have to decide which platforms you want to support, as each additional platform you choose to support will add further complexities.
Then, you’ll have to determine what back-end technologies to use to provide a reliable and scalable system.
Finally, you’ll have to hire the right engineers for the job.
Luckily, there are ways to speed up the process of building a video conferencing app. There are plenty of built-for-you technologies you can leverage. For example, you can use our video chat SDKs and APIs to easily add real-time chat and video to your app. We support all the major platforms and technologies, including React, Angular, Vue, and many more. Here are step-by-step tutorials to help you build your own Zoom clone:
- Build a Video Conferencing App like Zoom with React
- How to Create a Zoom Clone App for iOS
- Building a Zoom like Video Conferencing App for Android
- Build a Video COnferencing App in React Native
Ready to jump in? Sign up to our developer dashboard and start building your video chat app for free.
If you still have questions, feel free to talk to our experts and get answers before you get started.
Or, if you're interested in learning more about different chat app architectures, check out our guide on chat app architecture and system design.
About the Author
Cosette Cressler is a passionate content marketer specializing in SaaS, technology, careers, productivity, entrepreneurship and self-development. She helps grow businesses of all sizes by creating consistent, digestible content that captures attention and drives action.