Ever woke up at night with an alarm ringing close to your ears ? I did. Several times. The network was down, applications not working and in that half asleep state I had to investigate for hours to find a fix. Let me tell you my story and how all this can be prevented and avoided.
Imagine a normal day in your life as an engineer in the same way that I have seen it throughout my own eyes during the course of these last few years while being involved in rather diverse environments at various customer locations.
A new project starts and as expected I find myself involved in the design phase, I start doing white-boarding with others, defining what needs to be done and last but not least, the part that I love the most: getting new gadgets (the network equipment), having to set all this up and play with them to discover how they work.
It never takes me long to get the setup up and running, but the next step is always the one that consumes most of my time. Frankly, I can also say that this is not the one I love the most: migration of legacy infrastructure. This means slowly moving services off from the older platform, putting them on the new one, doing tests, night works during the weekends, making sure that any outage is minimised and, in the end, having both developers and server admins happy with the result of the new datacenter platform.
When all of this finally gets behind me, I always start thinking about the fun part and what experiments I can make, for example: “try out some automation, do some playing in the lab, see how the new Devops model works and what Git CI/CD is all about“.
All that I can say is that this master plan has almost never worked as expected. Furthermore, let me tell you why and maybe you can relate to my situation.
How many of us have seen something like this suddenly come to pass: all is running fine and you are building something up until a call comes in: “our systems are down, application X cannot reach application Y anymore, our whole finance department cannot process payments”… and so starts the debugging and time consuming detective work.
I try to connect to each equipment, take logs, see faults, errors, put my conclusions in a document, synchronise the time when all the errors happen and maybe index them in solutions such as Splunk or the ELK stack. It often takes me hours of work, headaches, pressure from upper management, having all eyes on me and feeling like chasing the needle in the haystack.
What if I told you there exist a way to avoid this with Nexus Insights?
What if I told you that you can have a simpler life and easily get back to the engineering work that you love instead of spending endless hours searching for the proverbial needle in the haystack.
This is where the Nexus Insights comes into play – to lift the troubleshooting effort off your shoulders and allow you to focus on the more important tasks. What is it exactly and what’s the recipe behind it ?
You first need a server appliance, the Nexus Dashboard. Simply said, it is a server cluster (the well-known UCS servers) on top of which one can run the Cisco Day2Ops Suite (Nexus Insights, Network Assurance Engine, MSO, etc) in a containerised fashion and soon your own software as well as long as you package it in a form suitable for a K8S environment.
Now you have your Nexus Dashboard and it comes with a GUI, but what can I do with it you may ask ?
First of all, the Nexus Insights can be downloaded from our site and installed onto it. It looks like this:
The Insights offers you two main functionalities:
- Awareness of bugs, security issues, forwarding plane state, TAC interaction / upload of tech support files
- Real Time monitoring of your environment with the added features of preventive alerting, machine learning/AI identification of trends and future outage prevention, flow visibility and granular troubleshooting for isolating where exactly an issue happened, log correlation.
Last but not least, it offers integration with AppDynamics for an E2E visibility into where a potential problem might be regarding your software (e.g. For a Webapp that suddenly experiences delays between frontend and database backend, it can answer the following questions: Is this the programming code ? is it the load on the Database server? Is it the network being congested and delaying packets ?)
If you are like me, you are probably more interested to see what you can do with such a solution, rather than going through product descriptions. Allow me to show you three simple use cases (3 is the magic number, right?) that I stumbled upon in my daily life that I think are quite common in a network where more people work on the same Datacenter and they also do periodic changes.
Scenario 1 – the colleague that plays with security rules:
Let’s say Albert and I are Network– or Devops Engineers (with the new terminology). Albert starts doing a network change to a filter inside an ACI Contract (plainly said, he removes the rule that allowed a server to access another server on port 39804).
At first I get some complaints from end-users and not being sure where to start, I open the Network Insights interface to trace the anomaly (deviation from normal access to the service) inside the real-time flows. Then I correlate it with the user action that caused it and isolate the policy change:
Scenario 2 – NAS performance is slow:
This is my one-time experience, not the only one. Ultimately, I get the developer team calling about file transfer performance being slow, and the systems team saying that this is not a problem from the NAS device but rather it’s a network issue . I never like playing this ping-pong game as it usually ends up being time consuming and delays my other tasks.
To mitigate this, I use Network Insights and notice a microburst event. SNMP normally doesn’t detect these events, because hardware telemetry / Flow visibility and real time PUSH based statistics cannot be detected via 5 min average based graphs.
Finally, after having solved these two issues I proceed to go back to my engineering tasks and to seeing how Terraform and Ansible perform in our network for automation and if I can incorporate them in a CI/CD flow. This was my initial focus when the project started after all.
The miracle of doing this without disturbance does not last too long and I get notified again that something seems to be misbehaving.
Scenario 3 – The malfunctioning application:
Now there is a latency between the Web and Database components of a newly deployed application. I have my doubts regarding the code quality and potential software issues regarding VM resource allocations. Still I decide to give it a quick check in Network Insights before passing the hot potato to other departments as I’ve done in the past.
Inside the tool I notice that application awareness is built-in via the AppDynamics integration. I open the corresponding section in Network Insights and analyse the latency reports between the web component and the database one.
I try to see if this issue affects just one communication flow or if it applies to all the traffic between the two. I then dive into the Event Correlation part only to notice that my colleague was testing a “Dataplane policer” and forgot to remove it.
I correct this behaviour (remove the policer) and I get feedback that the application is back to normal. As usual after these situations I approach my team mate over coffee, because now he owes me one :).
Want to take a quick look into how this works?
The troubleshooting part is over. I can finally resume my activity with no more disturbances and I’m happy that I was able to so quickly isolate all three issues without taking days to:
- Dig through logs
- Correlate events
- Try to reproduce the symptoms in a lab
- Deal with management pressure due to impacts to production capabilities and daily progress.
Finally, I’m back to my usual playground: experimenting with CI/CD, evaluating ways to expand our topology into the Cloud, scaling the existing setup, and playing around with Infra as a Code.
At the same time, I am happy to have avoided a whole day troubleshooting and flows-reverse engineering. And most of all, I did not have to pass the problem around from one department to another for analysis.
What you’ve just read is one of the super-powers that Network Insights can provide to the Network/DevOps Engineer, but that’s not all it can do.
If you want to learn more, see NI in action, understand what makes the system tick and see by yourself the value that it provides in finding problems before trouble manifests, then click here and explore the resources that dCloud has to offer:
Nexus Dashboard for ACI with Nexus Insights
I did not mention this one but if you’d like to check your configuration for errors, do a pre-change analysis or define compliance criteria then you’re in the right place with NAE:
3 Comments
Hi Mihai, excellent!
This is really good Mihai, thank you for writing up this useful post
Nice writeup. Mihai! This direction looks quite promising.