PHILADELPHIA – T-Mobile US continues to be one of the higher profile telecom operators that relies on Cloud Foundry to support its containerization and virtualization efforts. It also continues to be an active participant in contributing back into the Cloud Foundry ecosystem.
During a keynote demonstration of the Monarch chaos engineering tool, Ramesh Krishnaram, senior manager at T-Mobile US, said that the carrier’s work with Cloud Foundry was “easily the largest in the world.” Krishnaram noted that the carrier has more than 100 people on its Cloud Foundry engineering team overseeing more than 39,000 containers, supporting more than 3,000 mission-critical applications, and 700 million daily transactions.
“Everyone here knows that DevOps is a buzzword at many companies, but at T-Mobile it’s no longer a buzzword where we finally believe that you write a piece of code, you code it, and you own it all the way to production,” Krishnaram said.
One of those is the Monarch tool. It’s designed to provide more control over a chaos attack that is operating within a container pod.
Karun Chennuri, senior engineer for security architecture at T-Mobile US, explained that the targeted attacks can specifically go after a certain application without impacting adjacent applications.
“That is what we mean by application-level chaos attack,” Chennuri said.
Monarch is based on T-Mobile US’ previous work with the Cloud Foundry App Blocker (cf-app-blocker) plugin for the broader Chaos Toolkit. With the update to Monarch, T-Mobile US is able to more finely target that chaos attack without bringing down an entire application or service.
“We’ve taken an application that belongs to the customer, do a specific, targeted chaos attack on that application and also its dependencies,” Krishnaram said.
Monarch is currently limited to Cloud Foundry, but the carrier is looking to add support to Pivotal, Kubernetes, and VMware.
Chaos engineering is not a new phenomenon. Perhaps the most well-known effort is the Netflix-derived Chaos Monkey open source platform that can proactively cause failures in random places at random intervals throughout a systems.
However, some have noted that Chaos Monkey is not fine-tuned enough to deal with applications running within the same container pod.
“[Chaos Monkey] has been great for generating awareness, but it just randomly breaks things, and I don’t agree with that strategy,” explained Kolton Andrus, who is currently CEO of Gremlin and had previously worked at Netflix on Chaos Monkey. “There was a time for that, but we are more focused on safer testing.”
Gremlin last year updated its proprietary “failure-as-a-service” platform to allow it to re-create common Docker container failures across three categories: resource, network, and state. This allows developers to see how the system reacts to failures, validates that defense mechanisms will work to prevent system outages, and minimizes the blast radius of testing for safe experimentation in production environments.
“At the host level you can have 20 containers, and if you attack that host and it breaks, you have 20 broken containers,” he said. “We can now just break one container and reduce that blast radius.”
Photo: Karun Chennuri, senior engineer for security architecture at T-Mobile US (right), and Ramesh Krishnaram, senior manager at T-Mobile US (left), speaking during a keynote at this week’s Cloud Foundry Summit 2019.