This week, Facebook and its subsidiaries Instagram and WhatsApp went down in an outage that lasted nearly 14 hours. Catchpoint, a user experience monitoring company, found that while the widespread outages began around 12 p.m. ET on Wednesday, they were preceded by a “micro-outage” that lasted 20 minutes.
The monitoring company has noticed an increased prevalence of these types of micro-outages — which are basically website or application outages that last only a brief period of time, typically less than an hour, and occur in specific geographies. Catchpoint co-founder and CEO Mehdi Daoudi referred to them as a “very laser-focused type of issue somewhere where basically only a portion of users are impacted.”
Catchpoint’s monitoring technology measures the performance — based on the speed and reliability — of websites, mobile sites, and applications. The main focus of its technology is to understand the user experience at these sites.
The Facebook incident, said Catchpoint, is an example of a micro-outage that developed into a more full-blown, widespread outage. The tools, however, are not yet able to prevent these micro-outages, but identifying them is a step toward more quickly solving them. The goal, said Daoudi, is early detection. In the future, the “throne of data we collect” could eventually help find patterns and relationships that predict when something bad is going to happen, he said.
While widespread outages are more obviously impactful to a company’s bottom line, Daoudi noted that micro-outages can still have major repercussions for an organization’s service level agreements (SLAs) and revenues.
Facebook tweeted after the outage that the incident was “a result of a server configuration change.” There was some speculation that the blackout was caused by a hack or cyberattack, but the company denied that claim.
The Trend Toward Micro-Outages
Daoudi noted that at the advent of the World Wide Web, there were massive internet outages where a provider’s entire network would collapse. And while these still occur, they are less frequent “because things have gotten a lot more resilient and there is a lot more ISP [internet service provider] diversity.”
However, as these major outages have declined, there has been an increase in smaller, focused outages. This includes a Google outage last November where misconfigurations between Google, CloudFlare, and Nigerian ISP MainOne caused faulty redirects, which in turn caused certain Google services to be down over a 74-minute period.
Another example, also last November, impacted AT&T and AT&T wireless users. Those users were unable to access content served by CloudFront’s East Coast content delivery network (CDN) services via Telia.
According to Daoudi, people aren’t looking at these micro-outages well enough and the impact on the user-experience is not given due importance. “They are not spending time monitoring and detecting those small issues,” he said. “I think we’re all getting a little bit lazy.”
Daoudi noted that just “because there are no major internet outages these days, there is abundance of cloud infrastructure, [and] there are a lot of things that prevent or that stop some of these big issues that used to happen in the early days of the internet,” it doesn’t mean companies shouldn’t be looking. “Keep investing because customer experience is very important and it’s critical to keep investing in your early detection systems.”
More Micro-Outages or Better Tools?
One question poised to Daoudi was: Are these micro-outages actually occurring more or are the monitoring tools just getting better at seeing these types of outages?
He responded that equipment was indeed improving. “When you think about most companies, and specifically the groups or departments of those companies that do monitoring, there is a maturity evolution about what people care about,” Daoudi said.
In the past, monitoring centered around availability — answering the question: Is the network up or down? Then, it became more geared toward performance monitoring — answering the question: Is the application or service fast or slow? Now, Daoudi said companies are spending more time with multiple tools, “trying to understand the relationship between the business, and the IT telemetry, and the metrics that the IT systems are giving them.”
Which is why Catchpoint focuses on user experience with its monitoring tools, he said. And these micro-outages have a direct impact on the end-user. Basically, this boils down to access. For a retail site this could mean the end-user is unable to complete a purchase due to a micro outage, or in the AT&T case users were unable to access any site hosted by CloudFront.
“The customers don’t care if it’s a technical problem or not, all they care about is that brand didn’t provide me with an amazing customer experience,” said Daoudi. “And one of the things [Catchpoint] sees more and more is that a lot of companies realized that in the digital world we live in, customer experience is the product — ultimately, what makes or breaks a brand.”