A few nights ago, the alarm system in my home started beeping around 3:00 a.m. Not the kind of siren-plus-screeching noise that tells you someone just broke in or that you forgot to lock the back door again and the wind blew it open. This was the kind of gentle beep that intrudes into your dreams and finally annoys you enough that you drag yourself out of bed to figure out what the problem is. After less than a minute, it stopped by itself and it was only later that we figured out what had caused it. It turns out that our alarm system is smart enough to sense when the phone line (actually VoIP over cable) is down and to warn you that it wouldn’t be able to signal the monitoring station in the event of a break-in.
What’s the point of this story, other than to confirm in writing that I now know the instruction book for the alarm system is on the same shelf as the cat food? It’s simply that the system had never signaled this problem before, because telecom networks are incredibly reliable. The dial tone (or VoIP equivalent) is virtually always there and its absence is highly unusual.
That’s the premise behind carrier-grade networks. Over decades, telecom service providers have engineered an extensive range of sophisticated features into their networks, to the point where they guarantee “six-nines” reliability. That means the network is guaranteed to be up 99.9999 percent of the time, implying a downtime of no more than 32 seconds per year (which of course happens at 3:00 a.m. so that my alarm system can interrupt a good night’s sleep).
With all the industry initiatives around Network Functions Virtualization (NFV), network reliability has become a hot topic. As service providers refine their plans to progressively introduce NFV into their networks, they are carefully reviewing the implications for six-nines reliability on network infrastructure that incorporates virtualized functions. These functions, of course, can be in the customer premises (e.g. virtual CPE), the network edge (e.g. virtual firewall) or the core (e.g. virtual EPC).
There’s a daunting list of requirements for achieving carrier-grade reliability. They fall into four primary categories which we’ll discuss briefly here.
The first set of requirements is around network availability. For example, a telecom network needs to support geographical virtual machine (VM) redundancy over at least 500 km, to allow continued operation in the event of a natural disaster such as an earthquake, tsunami or hurricane. When faults do occur, the VM infrastructure must detect and recover very quickly. Faults must be detected and failover triggered in less than 500 ms. The network must support hot data sync so that no calls are dropped when failovers do occur. Virtual network functions (VNFs) must automatically recover from host failures and transport network failures. Above all, the VNF infrastructure software must be extremely high quality when deployed, to ensure fault monitoring, detection, and recovery work flawlessly when needed.
The second area of focus is security. Telecom networks have security requirements that go beyond typical enterprise installations. For example, in a 4G system, there must be no user traffic that is observable but not encrypted. Similarly, visible user data cannot be stored in the system. In an NFV data center or cloud deployment, operators have to implement efficient multi-tenant isolation and security, so that it’s impossible for one subscriber to access another subscriber’s traffic or data. And the network must fully implement protocols for AAA security, (authentication, authorization and accounting) to prevent unauthorized access, hacking, or terrorist attack.
Thirdly, a carrier-grade network has stringent performance requirements. The network must achieve high throughput but at the same time ensure very low latency for critical real time applications. The most demanding CPE and access functions require a deterministic interrupt latency of 10 microseconds or less, in order for virtualization to be feasible in the network. Similarly, live migration of VMs must occur with an outage time less than 150 ms.
Finally, there are critical requirements in the area of network management. The system must support hitless software upgrades and hitless patches, so that no unscheduled maintenance windows are required. There must be an integrated backup and recovery system. And full support must be implemented for a range of standard protocols to interface to the existing operations support systems (OSS) and business support systems (BSS) software.
Service providers know that you can’t achieve these requirements by starting from enterprise-class software that was originally developed for IT applications. This type of software usually achieves three-nines reliability, equivalent to a downtime of almost nine hours per year. Using that code as a baseline won’t get you to six-nines, no matter how hard you work to improve it. Carrier-grade reliability has to be designed-in from the start of the software development process, typically using the well-established, rigorous TL 9000 methodology.
“High availability” is a term that’s been used in the telecom industry for many years. It basically refers to systems which include enough excess capacity in the design to accommodate a performance decline or a subsystem failure. Service providers tell us that this is a key feature of carrier-grade networks, but it’s not sufficient. If you look at the carrier-grade requirements above, you’ll see that there are all kinds of performance-related constraints, none of which are addressed purely by redundancy in either hardware or software.
So don’t confuse “high availability” with “carrier grade”. The latter is much more demanding, but it’s an essential feature of today’s telecom networks. Whether as enterprise customers or consumers, we’ve been conditioned to expect extreme reliability in our networks. Service providers know that they need to continue to meet those expectations as they transition to NFV, otherwise they run the risk of losing their high-value customers and seeing increased subscriber churn. No new technology is worth that risk, regardless of the potential savings in capex and opex.
In subsequent posts, we’ll explore in more detail the technology requirements that must be addressed in designing a true carrier-grade network. For now, I’m just glad that my alarm system shouldn’t be telling me about dial tone problems for at least another year.