Severity Levels

The first step in any incident response process is to determine what actually constitutes an incident. Incidents can then be classified by severity, usually done by using "SEV" definitions, with lower numbered severities being more urgent. Operational issues can be classified at one of these severity levels, and in general you are able to take more risky moves to resolve a higher severity issue. Anything above a SEV-3 is automatically considered a "major incident" and gets a more intensive response than a normal incident.

Severity
Description
Typical Response
SEV-1 Critical issue that warrants public notification and liaison with executive teams.
    The system is in a critical state and is actively impacting a large number of customers.
    Functionality has been severely impaired for a long time, breaking SLA.
    Customer-data-exposing security vulnerability has come to our attention.
Major incident response.
    Page/DM an IC in Slack.
    See  During an Incident.
    Notify internal stakeholders.
    Public notification.
SEV-2 Critical system issue actively impacting many customers' ability to use the product.
    Notification pipeline is severely impaired.
    Incident response functionality (ack, resolve, etc) is severely impaired.
    Web app is unavailable or experiencing severe performance degradation for most/all users.
    Monitoring of LEANSTACK systems for major incident conditions is impaired.
    Any other event to which a PagerDuty employee deems necessary of incident response.
Major incident response.
Anything above this line is considered a "Major Incident". Our incident response process should be triggered for any major incidents.
SEV-3 Stability or minor customer-impacting issues that require immediate attention from service owners.
    Partial loss of functionality, not affecting majority of customers.
    Something that has the likelihood of becoming a SEV-2 if nothing is done.
    No redundancy in a service (failure of 1 more node will cause outage).
High-Urgency page to service team.
    Work on issue as your top priority.
    Liaise with engineers of affected systems to identify cause.
    If related to recent deployment, rollback.
    Monitor status and notice if/when it escalates.
    Mention on Slack if you think it has the potential to escalate.
    Trigger incident response if necessary.
SEV-4 Minor issues requiring action, but not affecting customer ability to use the product.
    Performance issues (delays, etc).
    Individual host failure (i.e. one node out of a cluster).
    Delayed job failure (not impacting event & notification pipeline).
    Cron failure (not impacting event & notification pipeline).
Low-Urgency page to service team.
    Work on the issue as your first priority (above "normal" tasks).
Monitor status and notice if/when it escalates.
SEV-5 Cosmetic issues or bugs, not affecting customer ability to use the product.
    Bugs not impacting the immediate ability to use the system.
Trello ticket.
    Create a Trello ticket and assign to owner of affected system.

Still need help? Contact Us Contact Us