Zscaler Outage History in 2026: Why a Shared Cloud Control Plane Is Your Single Point of Failure

Zscaler is the category leader, and its most disruptive outages have been self-inflicted, triggered by its own scheduled maintenance hitting its own proxy nodes. When every device routes through a shared cloud control plane, one bad change can take the whole fleet offline at once. dope.security runs security on the device instead, so there is no shared control plane to become your single point of failure. Here is what the outage record shows and why architecture, not effort, is the difference.
Why does Zscaler go down, and how often?
Zscaler is a strong product with a real problem baked into its shape: it is a proxy in the data path. Every request from every device forwards to a Zscaler ZEN or Service Edge node before it reaches the internet. That design centralizes control, and it also centralizes failure. When the thing in the middle has a bad day, everyone behind it has a bad day.
The outage record is documented and it points at maintenance, not just bad luck. On October 25, 2022, a subset of ZIA proxies saw 100% packet loss after internal maintenance affected the node VIPs Zscaler itself runs. On January 19, 2025, a multi-service outage landed during scheduled maintenance, and Dark Reading framed it as a redundancy problem rather than a one-off. The pattern across these incidents is consistent: the platform was taken down by changes to the platform. That is the signature of a shared control plane, and it is structural.
The control plane is the single point of failure
Pain point number one for cloud-proxy SSE is not the latency everyone complains about first. It is that the cloud is a single point of failure and the control plane is the weak spot. Many of these outages are self-inflicted through maintenance or a single bad configuration, and the cruel part is what happens next: customers often lose their dashboards and their logs in the middle of the incident, exactly when they need visibility most.
Think about the blast radius. If your security enforcement lives in a shared cloud tier and that tier goes down, you do not lose one office. You lose every user, everywhere, at the same moment, and you may lose the console you would use to diagnose it. There is no graceful degradation when the single point of failure is the thing failing. We cover the broader pattern in why teams are replacing Zscaler in 2026.
Agent-based architecture has no shared control plane to take down
dope.security is built the other way around. Security runs as a lightweight agent on the device, inspecting traffic on-device and sending it straight to its destination. We call it Fly Direct. There is no node in the middle that every user depends on, which means there is no shared tier whose maintenance window becomes a company-wide outage.
The agent also holds a fallback mode with cached policies, so a device keeps enforcing even if it cannot reach the cloud console at that second. Policy still pushes in real time when connectivity is normal, down to the individual and group level, but the device is not helpless without a live tunnel to a data center. That is the practical meaning of removing the single point of failure: a problem in one place does not become a problem everywhere. The contrast with the Zscaler Client Connector goes into the agent itself, including why a heavy connector that hangs on network transitions is its own reliability tax. The whole model is built on the Fly Direct secure web gateway, which runs the inspection on the endpoint instead of in a shared cloud tier.
What "self-inflicted" really means for your risk model
It is worth sitting with the phrase, because it changes how you should think about uptime. When an outage comes from an external attack or a regional internet failure, you can at least reason about it as bad weather. When the largest outages come from a vendor's own scheduled maintenance and a single bad configuration, the risk is no longer external. It is built into the operating model, and no amount of the vendor working harder removes it, because the failure mode is the design.
A shared control plane has to be maintained, and maintenance means change, and change on a tier that every customer depends on means every change carries fleet-wide downside. That is why a redundancy story does not fully solve it: redundancy protects against a node dying, but it does not protect against a bad config being pushed to all the nodes at once, which is the pattern these incidents show. The only way to remove a single point of failure is to not have one. On an agent-based model, a bad policy reaches the device it was scoped to, not a global tier, so the blast radius is a user or a group, not the company.
Distributed by design, not by data center count
Cloud-proxy vendors like to answer reliability questions by counting points of presence. More PoPs, the pitch goes, means more resilience. But PoPs are still a shared tier you route through, and adding more of them does not change the fact that your enforcement depends on reaching one and on the control plane that manages them all staying healthy. It is distribution of capacity, not distribution of failure.
dope.security is distributed in the way that actually matters for uptime: the enforcement point is the device, and there are as many of them as you have endpoints. There is no tier to reach before a user is protected and no console dependency for a device to keep enforcing policy in the moment. When people ask why a 250 to 5,000 employee company would trust security that does not run through a big-name cloud, this is the answer. The big-name cloud is precisely the part that took the fleet down. Our look at why backhauling to a data center is the wrong default covers the same logic for remote access.
Zscaler versus dope.security: the reliability-relevant differences
The full architectural comparison lives in the complete guide to replacing Zscaler. The table below focuses on the factors that decide whether a single failure becomes a fleet-wide outage, with each Zscaler cell grounded in documented behavior.
| Factor | Zscaler (ZIA) | dope.security |
|---|---|---|
| Architecture | All traffic forwards to a ZEN / Service Edge node in the data path | Agent on the device, traffic flies direct |
| Failure blast radius | Shared control plane; documented outages on Oct 25, 2022 and Jan 19, 2025 tied to maintenance | No shared tier; a device keeps enforcing in fallback mode with cached policies |
| Visibility during an incident | Customers report losing dashboards and logs mid-incident | Enforcement is local to the device, not dependent on a live tunnel |
| Agent footprint | Client can hang up to 60 seconds on wired-to-Wi-Fi transitions (vendor docs) | Under 100 MB RAM, 4x performance vs legacy proxy SWGs |
| AI governance | Prompt DLP and AI scanning sit in separate paid add-ons | Native 3-layer AI governance, no add-on |
| China | Sold as a paid China Premium / Plus uplift | Works in China with no paid uplift |
Documented Zscaler behavior from vendor docs and dated incident reports. The reliability gap is architectural, not a question of who tries harder.
The hidden cost: you also lose your logs
An outage is bad. An outage where you cannot see what is happening is worse. The recurring complaint in cloud-proxy incidents is that the same control plane that routes your traffic also serves your dashboards and logs, so when it falters you lose enforcement and observability together. Your incident response starts blind.
On-device enforcement separates those concerns. Because policy lives and runs on the endpoint, a console hiccup does not strip a device of its protection, and telemetry is not the only thing standing between your users and the open internet. This is the quieter reason teams move: not just uptime numbers, but the confidence that a vendor's maintenance window is not also your blackout. Our honest Zscaler review gets into what buyers discover about operations after they sign.
What this means for renewal math
Reliability is a line item even when it is not on the invoice. Every fleet-wide outage is engineering hours, helpdesk volume, and a credibility hit with the business. When you price a renewal, price the architecture that produced those incidents, not just the per-seat number. Zscaler's stacked editions and add-on AI tooling already make the bill hard to map, and reports of SKUs rising 35% or more without a public announcement do not help. We break the pricing down in what Zscaler actually costs in 2026.
The teams that switched did the same math. A mid-market healthcare organization that replaced Zscaler with dope.security wanted enforcement that did not depend on a distant shared tier staying healthy, and got a deployment that runs on the device with no backhaul and no console to lose mid-incident.
What to ask before your next renewal
If reliability is the concern, three questions cut through the marketing. First, when the control plane is under maintenance, what happens to my enforcement and my logs? If the honest answer is that both can degrade together, you have your single point of failure. Second, can a device keep enforcing policy if it cannot reach the cloud right now? On a backhauled proxy the answer is effectively no, because the proxy is the enforcement. On an agent model the device holds cached policy and keeps working. Third, what is the blast radius of one bad configuration push? On a shared tier it is everyone, and the documented incidents show that is not hypothetical.
Run those questions past your current setup and past any alternative. The point is not that outages never happen anywhere. It is that the architecture decides whether an outage is a contained event or a company-wide one, and whether you can see what is happening while it unfolds. A move off Zscaler is, at its core, a move off a shared dependency, which is why teams frame it less as a product swap and more as removing a structural risk. Our honest comparison of Zscaler and dope.security lays the two models side by side.
Reliability is an architecture decision
Zscaler is not unreliable because its team is careless. It is exposed because everything runs through a shared cloud control plane, and a shared control plane means a single change, a single maintenance window, or a single bad config can take the whole fleet down at once, often with the dashboard going dark right when you need it. That is the documented pattern across its largest outages. dope.security removes the shared point of failure entirely by running security on the device, so a problem in one place stays in one place. If you are tired of inheriting someone else's maintenance window, that is the difference that matters.
See how on-device enforcement holds up when the cloud does not. Read the complete guide to replacing Zscaler or start a free trial of dope.security.


.jpg)

