Back in October of 2017, I could have really used an observability suite.
We had just migrated the whole Cisco developer site, developer.cisco.com, from our in-house managed datacenter space to an AWS region, US West. All the QA, integration, and user acceptance testing had gone without a hitch. SSL certs were applied and working as expected. We went live with the site over a weekend. There were no complaints for a few days, and we thought we had just overseen a completely successful migration.
Then I got a ping. Our VP was showing an SVP the site on their phone. The VP’s phone could bring up the site no problem, but the SVP’s phone just could not resolve the page. Scrambling to figure out what had occurred, we were checking site access logs, database logs, and having everyone on the team hit the site from various devices. No joy. No one internally could replicate the issue. But then we did start to get a trickle of external reports of people experiencing the same failure.
Every day for a week, I was poking around the internet to figure out just what was causing the corner issue. Our engineers were trying to ID where the problem was occurring. Finally, I’m having lunch with a colleague, and I ask him to see if he can get to our site from his phone. He could not. I try on my phone. I can. We literally have the same make and model of phone, so I’m scratching my head. We head back to the office, and he comes by a bit later to let me know that he was able to hit the site later with no problem.
Finally, it dawned on me: at lunch we were both on our mobile carrier’s service, but in the office we are on Wi-Fi. I asked him to turn off Wi-Fi. Now he can’t get to the site! Finally, a workable lead. I get to searching and find out that with some mobile carriers and with a particular version of the phone, the combination of SIM settings plus the carrier network configuration was set to only resolve sites that had IPv6 addresses. “That’s funny,” I thought, “we were IPv6 enabled at our old datacenter. Surely AWS is also enabled for IPv6.” Turns out, they were… mostly. They were not for the configuration of VPC we needed to use in the region to which we had migrated.
It took a lift-and-shift to move our installation to a different AWS region, and finally the SVP (and other users!) could now get to our site.
What I Needed But Did Not Have
You might be asking, “How does this long story relate to full stack observability? Even if they had all the monitoring tools in place, they would’ve still needed the luck to figure this one out.” Granted, this was always going to be a difficult issue to run down. But FSO would have accelerated our ability to rule out false signals faster, or even instantaneously. We would not have had to pore over logs or check databases. We wouldn’t have had to do manual traffic checking. Or dig into the code to see what might be occurring. We would have known that those areas were red herrings and we would have narrowed our focus much more quickly to the client side. We would have been able to see if the requests were getting to our CDN and where the returns were failing, and arguably with the right tool we might have gotten a feed directly from our VPC that said, “Client cannot resolve IPv4 addresses.”
I’ve been in software development for 20 years, and anyone that has been writing — and more importantly, debugging — code for that long will tell you that the more visibility you have into the code the easier and quicker it is to find and fix an issue. Today, with the abstracted and layered complexity of applications, finding a fault is often extremely challenging. Throw in microservice architectures, and you have challenges not just with the physical layers impacting the application (network, compute, storage) but the virtualized ones like container volumes. Every single part of an application deployment, from the network, to the client, to the app, has an impact. You need visibility to issues on the entire, full stack.
Applications, and the people who maintain them, are better served when we can see and measure what’s going on, good or bad. If Accounting’s web application is running slow when they’re trying to close out a quarter, is the issue one of network bandwidth, or is it a persistently crashing application node? We should be able to identify that in seconds with a combination of streaming telemetry data from the network and application data from the mesh manager. If we are really savvy, we may even be able to identify faults proactively by feeding in data on situations where we know we might have – like spikes in database hits, or user load, both of which would require scaling up pods, for example.
The good news is that observability technologies and tooling keeps getting better at providing us deeper insight so we can make better decisions more quickly. With machine learning and AI added to the mix, we’re starting to see self-healing networks, processes, and applications. These tools will give us more time to innovate, and require less time from people trying to figure out why a bigshot can’t access an application.
Unfortunately, there is not (yet) a magic bullet to realize full stack observability. It requires conscientious design and implementation from people working on the network to those coding the applications. This work leads to tooling and instrumentation at various levels, providing the visibility and metrics needed to reach observability. We think it’s worth getting up to speed on the technologies and processes of observability.
To learn more, I recommend planning to stop by The DevNet Zone at Cisco Live US this year (either in person or virtually). You can learn a lot about what Cisco is doing to facilitate full stack observability from network monitoring automation and application insights with AppDynamics, all the way to the content delivery space and the client. Be sure to check out my workshop, Instrumenting Code for AppD, Thursday, June 16 at 9:00am PDT.
And check out sessions like these:
Read more about Observability:
I’ll see you at Cisco Live!
We’d love to hear what you think. Ask a question or leave a comment below.
And stay connected with Cisco DevNet on social!