Something is not working. New reports are coming in and every second there are new people asking what’s going on and when is it going to get fixed? Here's what we do at UserZoom
Incidents happen every now and then. I’m sure we can all agree that having a pre-prepared set of steps to follow is necessary because in the heat of the moment, it can make a real difference on how the situation evolves.
Here are a few of the things that we do at UserZoom...
The first step of any incident management process is the declaration. Someone or something has to realize an issue is happening and declare the incident so the team can start working on solving it.
Constantly improving the methods available to get a faster and more efficient declaration is, and should be, a never-ending effort. Things like automatic tests that run daily checking main features, alerts ready to go off when detecting anomalies in the infrastructure, even a coordinated review of the reports coming into the Support Department to notice patterns and ring the alarm... the faster you notice it, the faster you solve it.
It’s also worth mentioning that the more reliable methods you have to detect and notice an incident, the less false alarms can happen because of a human error. No one wants to be the team that keeps “crying wolf” when there’s nothing really happening. The only way to get everyone to agree on an emergency plan is to use it only for emergencies.
Once it’s official though, the eternal discussion of 'who should get involved' starts. When an incident happens, there are a lot of people that expect feedback and updates about what is going on (as they should) so they are able to handle expectations with customers and all sorts of stakeholders; but it’s not efficient for them to be part of the focused group that is working on the solution.
I personally identify two separate efforts that need to happen in parallel and, in my opinion, can’t be done by the same people: communicating and fixing the issue.
Those duties need to be split into different roles. It is essential to have a designated person not involved directly in the fixing process, but someone who is informed and present in the room, who can act as the go-between among the involved groups. That person is the gateway for all questions and coordinates the updates while keeping the rest of the group focused on the immediate problem and solution.
Regarding the communication, however, a different question presents: “How much do I say, or how detailed should I be?” At the beginning of most incidents, no one has a clear picture of what’s going on, or what is the amount of impact. As the person responsible for communication to stakeholders, it’s vital that there’s a certain amount of confidence in the message. The tension goes up when something bad happens, and it’s important to be clear about what's happening without over-alarming anyone.
Everyone tenses when there’s an incident. Things are happening fast, and depending on how critical an issue is, every minute counts when driving towards a possible solution. Stressful situations are not the easiest to navigate when there are several people involved, which is the reason I believe that a big (if not the biggest) part of the process of managing an incident is adjusting it to the actual people involved in it.
Patience, pragmatism, logical thinking and even a touch of humour are essential soft skills that make all the difference. Everyone wants to get to the end of the incident as friends :)
The incident is resolved. There’s no longer any panic and customers are happy again. But the work is not finished! Maybe even more important than the actual incident management is how you handle the postmortem. Understanding what happened and how to prevent it in the future is a vital part of the process and one that guarantees that the same issue won’t happen again.
In my opinion, it’s important to go over what was lucky during the incident. What happened that made us notice it before it became worse, or what knowledge was needed to fix the issue? Is it documented? Would someone else find it easily next time? Is the error something we can prevent? Or is it something we will need to be alert and just try catching it beforehand next time?
Every team, company and department works differently, but my conclusion is that to build an efficient incident management process, we all need to check what helps and what doesn’t in critical situations for your own team.
Find everything that makes a tense, high-pressure situation work better, try to enforce it, and make a process out of it. Slowly shaping the correct dynamics in situations like this makes a big difference in efficiency and decreases the resolution waiting time.
“Teamwork makes the dream work” they say… and it’s true that the only way to handle an incident and be able to learn from it requires everyone’s effort and collaboration. There are so many skills involved in incident management (communication, technical expertise, pragmatism, attention to detail) that not one person can handle it alone. As with so many other things in life, I believe that finding the right group of people willing to help you with the effort is the key to success.