GROUP A - Major Incident Management

July 11, 2023

A Major Incident is defined as an event that has a significant impact or urgency and demands an immediate response. The Major Incident Management Process should define the criteria for the Major Incident Management role, the objectives of the role and the expected outcomes and how to efficiently manage a Major Incident. The process should define what a major outage is, what the various severities are, what the response and restore targets are, what the communications/alerting process is, what the technical and management bridge process is, what the outage review process is, and how the Problem Management Team is engaged. The Process should also define how information should be collected, analyzed, and reported to teams and senior management.

  • Overall accountability for Major Incidents related to their services
  • Escalation point for all compliance and technical issues derived from the Major Incident Management Process
  • Diagnose and restore service or delegate to appropriate Specialist for all issues in their specialty or area of expertise as assigned
  • Ensure Major Incident records are updated with troubleshooting efforts (Activity Updates)
  • Ensure Root Cause Analysis is prepared and submitted
  • Ensures level of response reflects the Severity of the Incident
  • Attend or delegate Major Incident meetings as required and act as an escalation point
  • Audits all internal information and recovery steps to ensure consistency, accuracy and proper documentation
  • Accountable for any communication required relating to Major Incidents
  • Ensure process compliance
  • Ensure Support Team attendance at all required meetings
  • Ensure IT contact lists are updated and maintained
  • Liaison with vendors

Process Objective:

The process objective is to ensure that when a major incident has been identified, the major incident management manager will work with various resolver groups to restore the service in a timely manner. Once the major incident has been resolved, a process hand-off occurs to the Problem Management team.

The process objective also ensures that the Major Incident Management Coordinator is aware of his or her key responsibilities with respect to acting when a major incident has been identified. The Major Incident Management Coordinator should work with the Enterprise Major Incident Management Team and any others as needed to resolve the major incident in a timely manner.

Sample List of Benefits:

  • The process should focus on the most effective steps to take to coordinate recovery efforts with the Enterprise Major Incident Management Team and resolve the major incident.
  • The Process should allow for the collection of Major Incident data and facilitate and encourage the thorough analysis of the data to identify the risks and reoccurring problems.
  • The Process should encourage proper documentation of all relevant information in the ticketing system.
  • The Process should encourage the right level of quality in all the documentation.
  • The Process should encourage the proper use of the ticketing tool.
  • The Process should encourage the proper integration with the Enterprise Major Incident Management Team.

Sample List of Observations:

  • The client is bypassing the impact and urgency matrix and escalating lower priority incidents
  • The Major Incident team is not mobilizing in a timely manner to coordinate the recovery efforts
  • Technical teams are not showing leadership in terms of validating their areas, they are looking for direction in terms of what they need to do next
  • Deep and wide skill sets are missing. Deep: Team is limited in knowledge and is not able to address issues/changes that are beyond their immediate SS knowledge. Wide: Where multiple technologies intersect, the depth of knowledge becomes more evident. There is limited overall leadership/knowledge which drives end to end consideration in both change/project and incident resolution
  • The Major Incident team is not taking charge of the recovery efforts
  • Executive Alerts are not sent in a timely manner to executives or impacted people
  • The Major Incident team is not documenting the chronology of all actions being taken, resources working on the recovery efforts, etc. Group chats or call recordings not being done consistently
  • Major Incident team not triaging the issue upfront to understand the situation or the impact
  • Major Incident team not assessing the impact of Changes, Releases and any other activity in the environment that could have caused the MI
  • Separate Management bridges are not opened consistently to ensure Executives are being updated on progress and impact
  • Executives and Business Leaders are joining the technical bridges and delaying the restoration efforts
  • Recovery steps have not been documented for each area SW, HW, NW, etc.
  • Central knowledge base not available to search if the issue is a known or repeat issue
  • Off-hour skills are lacking, hard to reach skilled resources to work on service restoration
  • Not all MIs are documented. MI metrics are skewed.
  • Major Incident Management process not in place. Changing every day.
  • No training was provided on MI process
  • Event Management and Monitoring & Alerting Management process needs to be reviewed
  • Performance & Capacity Management process needs to be reviewed
  • Change/Release/Configuration Management processes need to be reviewed
  • The Problem Management process needs to be reviewed
  • The Availability Management process needs to be reviewed
  • Lack of coordination and structure around MIs. During and after
  • Lack of trends analysis on MIs

Sample List of Recommendations:

  • Join major incident bridge lines, quickly gain a core understanding of the issue in flight to structure workstreams and control the flow of communication to ensure focus is maintained on driving service restoration.
  • Ensure all incidents have clearly defined business impact statements and communicate with stakeholders and/or executives through multiple streams (persistent chat, email, bridgeboard) to provide the appropriate level of detail for awareness and collaboration.
  • Facilitate investigation by engaging the required technical or business resources to assist in the triage of a major incident.
  • Demonstrate an understanding of impact and severity, to escalate for management attention as required through succinct and well-articulated executive-level communication.
  • Leverage monitoring platforms to identify potential causes or solutions to incidents from past occurrences.
  • Participate in planned data center/site isolations tests for critical sites.
  • Observe lower-severity incidents and ensure the right level of focus is given to restore services, to avoid escalating into a Major Incident.
  • Send periodic incident communication to stakeholders including executives as when it’s required.
  • Ensure the MI process is documented, and all parties are made aware.
  • Ensure the impact and urgency matrix is defined and shared with all parties. If there are non-compliance issues, address the issues via the daily ops meetings, weekly scorecard meetings, monthly operational meetings, and the monthly governance meeting.
  • Ensure that the Major Incident team is identified, local to the teams and enterprise coverage
  • Ensure the MI team has defined coverage across 7/24/365 so that they can mobilize as soon as an MI is raised
  • Conduct a skill gap analysis of the technical teams to ensure they understand their support areas and are able to actively participate in recovery efforts
  • Ensure the MI team has the right level of leadership skills to take charge and drive the recovery efforts
  • Ensure the Executive Alert process is documented and shared with all parties
  • Ensure at the onset of the MI call, the MI team initiates a group chart to document all actions being taken and observations are being made and they initiate a call recording for audit and problem management purposes
  • Ensure prior to initiating the recovery efforts, the MI team steps back and fully understands the situation and the impact.
  • Ensure the Major Incident team assesses the impact of Changes, Releases and any other activity in the environment that could have caused the MI
  • Ensure the MI team opens Separate Management bridges to update the Executives on progress and impact
  • Ensure the MI team prevents the Executives and Business Leaders from joining the technical bridges and delaying the restoration efforts
  • Ensure the Recovery steps are documented for each area SW, HW, NW, etc.
  • Ensure a Central knowledge base is available to search if the issue is a known or repeat issue
  • Ensure the Off-hour skills is adequate
  • Ensure the documentation for each MI reflects the complete effort taken to restore the service, what worked and what did not
  • Ensure Event Management and Monitoring & Alerting Management process are reviewed to assess the impact of false alerts and alert thresholds that are not set correctly, are having on MI mitigation or response
  • Ensure Performance & Capacity Management process is in place to understand the impact on MI mitigation or response
  • Ensure Change/Release/Configuration Management processes are reviewed to assess the impact on MI mitigation or response
  • Ensure Problem Management process is in place to identify root causes to avoid repeating MIs
  • Ensure Availability Management process is placed so that an availability plan can be built, maintained and executed
  • Ensure trends analysis on MIs are conducted as part of the continuous service improvement plan


Assessment Questions:


  • Do you send out an Executive Alert (define the list - internal/external)? Is the process defined? Is it executed consistently?
  • Provide updates every X mins (15-30 Mins) – Is the process defined? Is it executed consistently? How is the quality of the updates? Is the list of Internal and External participants defined?
  • Document all actions taken (chronology of events) and by whom – Is someone leading/coordinating the call to ensure all actions are being tracked? Is this consistently executed?
  • Document all persons involved – Do all teams show up on time? Is someone documenting when they joined, from what team, and what they are doing? Is this being consistently? Do you receive the necessary support from the people joining? Do they have the necessary technical skills? Are they aware of the client environment?
  • Ensure Ticket is being updated with all the necessary information at closure of the incident – does someone ensure that the ticket is updated with all relevant information? Is this executed consistently? Are QTMs performed to assess for quality?
  • Open a technical bridge and ensure all participants are invited? Are technical bridges opened in a timely manner? Are the relevant people invited? Do they join on time? Is the client on the technical bridge? Does the client impact restoration efforts?
  • Open a chat bridge and ensure all observations, hypotheses’, actions are documented. Is a WebEx/Lync/Sametime/Skype chat bridge opened to document all discussions, observations, and actions?
  • Open a management bridge to keep executives updated. Is a management bridge opened to provide updated to senior executive? Who opens the Mgmt. bridge? Who is invited to the bridge? How are the updates provided? How often is the bridge opened to provide updates? Is this done consistently?
  • Obtain a topology of the environment (HW/SW/NW – High-Level and Low-Level design documents). At the start of the MI, whether it’s an Application Outage, Service Outage, or Infrastructure outage, Is there a topology diagram used to understand how everything is connected to one another? Does some lead the troubleshooting efforts by identifying all potential causes and conducting HealthChecks each connection or along the connection? If documentation is not available, are requests submitted to have the documentation built?
  • Do you use White-boarding to document observations, presumptive causes, actions underway, areas of pursuit, etc. Is a Whiteboarding tool used to visualize the environment, actions taken, actions planned, hypothesis, etc.?
  • Do you track all actions with owners and dates? Does someone track all actions with owners and dates? Are all actions coordinated to avoid collisions?
  • Do you coordinate the discussions? Does some lead the call to understand the issue, identify presumptive causes, pursue the presumptive causes to find the root cause and restore the service? Is this consistently done? Have there been any complaints about the effectiveness of the leadership?
  • Do you engage global resources for additional assistance? When the restoration efforts are stalling, are global resources/vendors engaged? Is there a process to engage them? Is the process documented with contact information? Is a DPE engaged when progress is not being made and the SLA is at risk of being missed?
  • Do you have a contact list of all support personnel (on-call, escalation contacts)? Is the list current? Where is it maintained? Is everyone aware of its location? Does everyone have access to it?  
  • Is the Change/Release Management process functioning as designed? Are there any issues? Is the process causing MIs?
  • Is the Event Management process functioning as designed? Are there any issues? Are any of the outages taking place because people have not responded to the events that have been generated alerting teams of impending issues?
  • Is the Monitoring & Alerting process functioning as designed? Are there any issues? For the application/service/infrastructure components that are failing, was they monitored and were the thresholds appropriately set?
  • Is the Performance & Capacity Management process functioning as designed? Are there any issues? For the application/service/infrastructure components that are failing, was the performance and capacity requirements being monitored to ensure that there were sufficient resources to meet the demand?
  • Is the Firmware/Patch/Currency Management process functioning as designed? Are there any issues? For the application/service/infrastructure components that are failing, were they failing due to firmware or patch levels not being at the recommended levels? Were they failing due to the age/EOL status of the systems?
  • Is the Business Controls Management process functioning as designed? Are there any issues? For the application/service/infrastructure components that are failing, were they failing due to unauthorized access issues?
  • Do you ensure an Outage Review is prepared and submitted within 24 to 48 hrs? So, you can properly hand-off to the Problem Manager?
  • Do you work with Clients and Business Units to share information about the volume, type, and cause of Major Incidents?
  • Do you have an escalation point for all compliance and technical issues derived from the Major Incident Management Process?
  • Are problem records opened for all Sev 1s and/or 2s?
  • Are you given time to diagnose the Major Incident and implement solutions?
  • Are you pressured by business or senior management during an MI?
  • Do you obtain Customer concurrence of issue resolution prior to closure of Major Incident?
  • Do you receive Major Incident via Service Desk, Other Support Group, Operations, Alert or Phone Call, or Shoulder Tap?
  • Does the MI team continue to be involved with the Problem record or do they stop once the Problem Management team takes over?
  • Do you know or have access to the topology of all CIs in the environment? If there is a product outage do you know what Applications, Middleware, Databases, Storage, Servers and Network components support that Product?
  • How often are Major Incidents Upgraded or Downgraded?
  • What are the Major Incident SLAs? Do you know what they are?
  • What are the top 10 drivers for Major incidents for the current and previous year?
  • What are the top 10 teams who generate Major Incidents for the current and previous year?
  • What are the top Clients/Business units affected by the Major Incidents?
  • What is the average MTTR for the current year and previous year by month?
  • What is the average response time for the Major Incidents for the current and previous year?
  • What is the volume of MI records for the current year and previous year by month?
  • Do you conduct audits of all internal information and recovery steps to ensure consistency, accuracy and proper documentation?
  • Do you partner with the Quality Management Coordinator – Ticket Quality Management to review a random sample of Major Incidents and assess for quality?
  • Do you provide feedback to Service Desk on areas of improvement?
  • Do you Share findings/observations with the Communications Coordinator via emails, newsletters or team meetings on trends and improvement areas?
  • Does anyone on the team conduct any trend analysis of the Major Incidents?
  • Does anyone on your team review the Major Incidents on a regular basis for quality, accuracy, and completeness of information?
  • What quality management frameworks do you use on your team to improve the Major Incident Management process? DPP? PBA?
  • Do you work with Knowledge Management Coordinator to create new or update existing knowledgebase documents?
  • Do you work with Reporting and Measurements / Continual Improvement Coordinator to review and analyze Major Incident reports?
  • Do you work with Training Management Coordinator to address any skill gaps in Major Incident Handling?


Signup to read full articles

Ready to listen to what your data is telling you?

Book A Consultation

Subscribe to our Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.