EXFO Ontology’s Benedict Enweani on Service Assurance Automation
Service provider networks are in a state of flux, with the rise of network function virtualization (NFV), 5G, and cloud strategies all creating a complex, multidomain environment. Against this dynamic backdrop, service assurance and performance management have become even more important as competitive realities place a spotlight on the end-user experience as a differentiator.
To effectively and efficiently provide service assurance, automated cross-domain topology offers the ability to improve almost all OSS and BSS, and link simultaneous incidents on the network to the services that are being sold to and consumed by the customer.
In this interview, Benedict Enweani, director of business development, systems and analytics at EXFO Ontology, discusses challenges in achieving effective and efficient service assurance in modern service provider networks, as well as the benefits of automation.
Why is it so difficult for service providers to diagnose and troubleshoot problems in their networks?
Benedict Enweani: I think there are probably two main reasons, although there are so many issues that you could probably do an MBA on this.
Firstly, it’s hard to know the true multidomain connectivity of a telco network, that has been built over a period of 10 years or longer. Gathering all of that data in one place is hard, and in addition, it’s hard to keep your knowledge up to date once it’s been collected.
Information about how the network is linked is kept in numerous decentralized, independent sources. Each vendor usually has a different control system for each network technology they sell; and for some of the network layers, there may be no real controlling systems at all, so you have to try and individually gain visibility into tens to hundreds of systems, and potentially thousands to millions of devices, just to understand each island in the multi-domain network.
Integrating all of this data together is difficult and has proven to be too hard for existing enterprise database technology. Gaining knowledge about where those silos of technology are located is difficult; understanding the topology of how they are cross-linked is even harder because cross-links may not even be recorded in the network systems. So, you also have to consider other OSS/BSS and file-data sources, or speak to designers and work out complex manual rules that can be used to power inferences that bridge the gap.
And on top of all of that, networks are live and stuff changes all the time, and much of the network is managed by humans. People make mistakes. For instance, they forget to document their activities or lose information. As a result, all bets are off when someone has to, for example, do an emergency fix or find the historical data needed to efficiently react to a crisis.
The second issue is that it’s hard to know when something is going wrong, or what the cause is because it’s across all layers of the network. Collecting test and KPI data at all layers (Layers 1-7) is a big task, and very few operators have an easy way to bring that performance data together. There are multiple systems that collect experience indicators at all of those different layers.
Then when you try to troubleshoot issues from the edge of the network, you need a fusion of all of that KPI data with a perfect real-time dependency map of the network, in order to analyze the situation using both of these distributed information sources. Typically, humans have to manually correlate the data and use their judgment to put all of that together.
Why is this process currently so inefficient?
Enweani: Ok that’s easy: People have recognized that manually drawing up topology maps is a pointless task because things are so dynamic in the network. Providers have been wanting to automate this for a while, in order to make the process more efficient. In fact, transformation projects were very popular 10 to 15 years ago. However, many of these have failed because the change-management, fault-management, troubleshooting and outage teams all use completely different data sources that are too hard to correlate. So right now, many topology searches are being done by humans and that’s not what we’re best at.
Also, very few systems support the creation and collection of KPI and test data (both active and passive) for all network layers, which means that the data is not centrally analyzed, even if it exists. This drives more integration efforts, and a lot of service providers are looking to Big Data systems to do that, which come with spiraling capital costs.
On top of that, it is very hard to map the collected KPIs and test data to the current topology. Even if an operator has a good view of their network at two or three of the layers, it still has to find a way to associate that information with all of the services that ride on top. Most of the time, it will be a human operator trying to join the dots. That takes skill, knowledge, time and, as it is with all human processes, it’s an error-prone endeavor. There are only a small number of people that can do this, so the timeline is long and the length of the outages and the amount of suffering involved is greater than desired.
How will automation make network problems easier to detect?
Enweani: Bringing all this together is really only possible now because of recent advances in four technologies: scalable no-SQL graph stores, Big Data storage, semantic rules engines, and (we hope) machine learning.
Scalable no-SQL graph stores bring together multiple network islands using a different data structure. Big Data meanwhile is now the norm rather the exception, as is storing that data in a central place so you can make use of it, tapping into the data lake with analytics and advanced tools alike.
Semantic rules engines help determine which patterns are important because of the KPI data correlated to the network. And increasingly, machine-learning algorithms are being applied to the task, as are convolutional neural networks.
Using these tools together gives us enough sensory data about the network to unambiguously detect failures. Even if a failing device doesn’t automatically present itself and send an alarm to the operations team, we have a “good-enough” algorithm which, if fed with accurate topology and sensory data, can proactively indicate the reason these customers are impacted and can reveal the probable offending device.
If you feed symptom data into a rules engine, it will alert you to the location where the potential problem could be, and this development represents is a fundamental change and an increase in the power that we have to detect problems. What’s really cool is that if we put these elements into an automated loop where we’re constantly scanning the data and finding causes from symptoms, then we can advance the state of the art by reporting the cause, rather than reporting the alarm.
With the underlying technology to make all of this scalable, putting these four together in a useful way is the key to building a platform for automation.
How is EXFO incorporating automation into its platform roadmap?
Enweani: With our Blueprint 2.0 strategy, we are combining the resources from recent acquisitions, including Astellia (a leading capability in passive probing analytics, machine learning, and big data), Ontology (a graph-database powered real-time topology SIA and CCA) and EXFO (active test measurement, and real-time analytics, and test/measurement at all layers, stored in cloud and accessible on demand).
This is a DevOps-friendly platform specifically designed to implement communications service provider-specific test, service assurance and troubleshooting automation. Many of these capabilities are already deployed. The important thing is to aggregate them and provide meaningful APIs. That information can then be used in end-to-end operations.
We plan to launch Automated Common Cause Analysis (for general availability) by the end of the year, which is the first example of complete end-to-end automation, from measurement to the detection of problems to the proposing of the root cause. It can be done all without humans, and in tight time cycles.
We see this as a strategic capability: Without automating this, how do we deal with the fallout from all of the increased traffic from new services and new network capabilities?
Why should service providers invest in automation tools?
Enweani: There are a couple of drivers for the investment.
First of all, automation lowers opex. Troubleshooting and service assurance are two of the least-automated, most-static processes out there in a world that’s increasingly automated. You can’t just hire people to meet the challenge. It’s hard to see how manually configured service assurance and manual troubleshooting will keep up with massive increases in the number of network clients that are automatically provisioned and configured, and the larger set of services running over the infrastructure.
It’s also likely that the population of devices on the network will grow by a couple of orders of magnitude in the next two to three years – and more users mean more individuals that can experience problems. So, it would be unwise to assume you shouldn’t improve the sophistication of your service assurance capabilities.
5G and network function virtualization (NFV) are aimed at increasing the number of clients on the network (clients that can experience failure or degradation), and they both drive a heightened rate of network change, thanks to orchestration engines. This, in turn, obscures the knowledge of the network structure they use. This means that network outages could remain un-serviced for long amounts of time.
When it comes to automation the bottom line is this; service providers have a carrot – this is lowering opex, and a stick – this is the need to service more users amidst increasing network complexity and dynamism.