
Guiding Principles of Major Incident Management
A Major Incident is defined as an event which has significant impact or urgency and demands an immediate response. The process objective is to ensure that when a major incident has been identified, the major incident management manager will work with various resolver groups to restore the service in a timely manner. Once the major incident has been resolved, a process hand-off occurs to the Problem Management team.
The Major Incident Management Process should define the criteria for the Major Incident Management role, the objectives of the role and the expected outcomes. The process should define what a major outage is, what the various severities are, what the response and restore targets are, what the communications/alerting process is, what the technical and management bridge process is, what the outage review process is, and how the Problem Management Team is engaged. The Process should also define how information should be collected, analyzed, and reported to teams and senior management.
- Overall accountability for Major Incidents related to their services
- Escalation point for all compliance and technical issues derived from the Major Incident Management Process
- Diagnose and restore service or delegate to appropriate Specialist for all issues in their specialty or area of expertise as assigned
- Ensure Major Incident records are updated with troubleshooting efforts (Activity Updates)
- Ensure Root Cause Analysis is prepared and submitted
- Ensures level of response reflects the Severity of the Incident
- Attend or delegate Major Incident meetings as required and act as an escalation point
- Audits all internal information and recovery steps to ensure consistency, accuracy and proper documentation
- Accountable for any communication required relating to Major Incidents
- Ensure process compliance
- Ensure Support Team attendance at all required meetings
- Ensure IT contact lists are updated and maintained
- Liaison with vendors
The objective of the process should be as follows:
The process objective should ensure that the Major Incident Management Coordinator is aware of his or her key responsibilities with respect to acting when a major incident has been identified. The Major Incident Management Coordinator should work with the Enterprise Major Incident Management Team and any others as needed to resolve the major incident in a timely manner. Once the major incident has been resolved, a process hand-off occurs to the Problem Management Process.
Sample list of benefits:
- The process should focus on the most effective steps to take to coordinate recovery efforts with the Enterprise Major Incident Management Team and resolve the major incident.
- The Process should allow for the collection of Major Incident data and facilitate and encourage the thorough analysis of the data to identify the risks and reoccurring problems.
- The Process should encourage proper documentation of all relevant information in the ticketing system.
- The Process should encourage the right level of quality in all the documentation.
- The Process should encourage the proper use of the ticketing tool.
- The Process should encourage the proper integration with the Enterprise Major Incident Management Team.
Sample list of observations:
- Client is bypassing the impact and urgency matrix and escalating lower priority incidents
- Major Incident team is not mobilizing in a timely manner to coordinate the recovery efforts
- Technical teams are not showing leadership in terms of validating their areas, they are looking for direction in terms of what they need to do next
- Deep and wide skill sets are missing. Deep: Team are limited in knowledge and are not able to address issues/changes that are beyond their immediate SS knowledge. Wide: Where multiple technologies intersect, the depth of knowledge becomes more evident. There is limited overall leadership/knowledge which drives end to end consideration in both change/project and incident resolution
- Major Incident team is not taking charge of the recovery efforts
- Executive Alerts are not sent in a timely manner to executives or impacted people
- Major Incident team is not documenting the chronology of all actions being taken, resources working on the recovery efforts, etc. Group chats or call recordings not being done consistently
- Major Incident team not triaging the issue upfront to understand the situation or the impact
- Major Incident team not assessing the impact of Changes, Releases and any other activity in the environment that could have caused the MI
- Separate Management bridges are not opened consistently to ensure Executives are being updated on progress and impact
- Executives and Business Leaders are joining the technical bridges and delaying the restoration efforts
- Recovery steps have not been documented for each area SW, HW, NW, etc.
- Central knowledgebase not available to search if the issue is a known or repeat issue
- Off-hour skills is lacking, hard to reach skilled resources to work on service restoration
- Not all MIs are documented. MI metrics are skewed.
- Major Incident Management process not in place. Changing every day.
- No training provided on MI process
- Event Management and Monitoring & Alerting Management process needs to be reviewed
- Performance & Capacity Management process needs to be reviewed
- Change/Release/Configuration Management processes need to be reviewed
- Problem Management process needs to be reviewed
- Availability Management process needs to be reviewed
- Lack of coordination and structure around MIs. During and after
- Lack of trends analysis on MIs
Sample list of recommendations:
- Ensure the MI process is documented, and all parties are made aware.
- Ensure the impact and urgency matrix is defined and shared with all parties. If there are non-compliance issues, address the issues via the daily ops meetings, weekly scorecard meeting, monthly operational meeting, and the monthly governance meeting.
- Ensure that the Major Incident team is identified, local to the teams and enterprise coverage
- Ensure the MI team has defined coverage across 7/24/365 so that they can mobilize as soon as an MI is raised
- Conduct a skill gap analysis of the technical teams to ensure they understand their support areas and are able to actively participate in recovery efforts
- Ensure the MI team has the right level of leadership skills to take charge and drive the recovery efforts
- Ensure the Executive Alert process is documented and shared with all parties
- Ensure at the onset of the MI call, the MI team initiates a group chart to document all actions being taken and observations are being made and they initiate a call recording for audit and problem management purposes
- Ensure prior to initiating the recovery efforts, the MI team steps back and fully understands the situation and the impact.
- Ensure the Major Incident team assesses the impact of Changes, Releases and any other activity in the environment that could have caused the MI
- Ensure the MI team opens Separate Management bridges to update the Executives on progress and impact
- Ensure the MI team prevents the Executives and Business Leaders from joining the technical bridges and delaying the restoration efforts
- Ensure the Recovery steps are documented for each area SW, HW, NW, etc.
- Ensure a Central knowledgebase is available to search if the issue is a known or repeat issue
- Ensure the Off-hour skills is adequate
- Ensure the documentation for each MI reflects the complete effort taken to restore the service, what worked and what did not
- Ensure Event Management and Monitoring & Alerting Management process are reviewed to assess the impact of false alerts and alert thresholds that are not set correctly, are having on MI mitigation or response
- Ensure Performance & Capacity Management process is in place to understand the impact on MI mitigation or response
- Ensure Change/Release/Configuration Management processes are reviewed to assess the impact on MI mitigation or response
- Ensure Problem Management process is in place to identify root causes to avoid repeating MIs
- Ensure Availability Management process is place so that an availability plan can be built, maintained and executed
- Ensure trends analysis on MIs are conducted as part of the continuous service improvement plan