Overall Services Health, Da...

Nedtid

Overall Services Health, Database Health, and 1 other service are down

23 dec, kl. 15:01 (UTC)

Påverkade tjänster

Overall Services Health

Database Health

Production Endpoints Health

Avslutad
10 feb, kl. 15:24 (UTC)

Incident summary for 🟣 App Service 5xx Errors: Platform (Oct 27 at 8:17pm CET)
Cause: AppService-5xxErrorSpike
Started at: Oct 27 at 8:17pm CET
Length: 8 minutes
Acknowledged by: Sofia Petrov
Slack channel: #better-stack-monitoring, #status-page-updates
View incident in Better Stack ↗

AI post-mortem
A deployment caused a spike in HTTP 5xx errors in the nbg-2 platform, pushing the 5xx rate above the alert threshold. The on-call engineer identified the faulty deploy, reverted it, and the error rate fell back below thresholds; monitoring then auto-resolved the incident. A post‑mortem recommended tighter deployment checks and automated smoke tests to prevent recurrence.

Metadata

Condition #1 - Metric value:
3.4

Condition #1 - Time aggregation:
Average

Condition type:
SingleResourceMultipleMetricCriteria

Description:
HTTP 5xx errors exceeded 2% over the past 10 minutes.

Monitoring service:
Platform

Region:
nbg-2

Severity:
Sev2

severity:
critical

Signal type:
Metric

Timeline

Incident started. · Oct 27, 2025 at 8:17pm CET

Incident update posted to #status-page-updates. Slack · Oct 27, 2025 at 8:17pm CET

Sent an email to Clara Mendes at clara.mendes@shadowfusion.cloud. · Oct 27, 2025 at 8:17pm CET

Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:17pm CET

Filip Hoang commented · Oct 27, 2025 at 8:18pm CET

Can you please investigate ASAP Sofia Petrov ? Please feel free to escalate if not resolved soon.

Sofia Petrov commented · Oct 27, 2025 at 8:19pm CET

Already on it Filip Hoang.

Sent an email to Adam Novak at adam.novak@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET

Sent an email to Sofia Petrov at sofia.petrov@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET

Sent an email to Priya Nair at priya.nair@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET

Sent an email to Filip Hoang at filip@betterstack.com. · Oct 27, 2025 at 8:20pm CET

Sent an email to Erik Johansson at erik.johansson@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET

Sent an email to Liam Walker at liam.walker@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET

Sofia Petrov commented · Oct 27, 2025 at 8:20pm CET

Seems it's related to Liam Walker's deploy that's causing an increased error rate. Reverting NOW.

Sofia Petrov commented · Oct 27, 2025 at 8:22pm CET

Revert still in progress, seems it's going up though.

Filip Hoang commented · Oct 27, 2025 at 8:23pm CET

If you're on it please ACK the incident Sofia Petrov .

Sofia Petrov commented · Oct 27, 2025 at 8:23pm CET

Sorry my bad!

Incident acknowledged by Sofia Petrov. · Oct 27, 2025 at 8:23pm CET

Incident update posted to #status-page-updates. Slack · Oct 27, 2025 at 8:23pm CET

Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:23pm CET

Sofia Petrov commented · Oct 27, 2025 at 8:24pm CET

Revert completed.

Incident resolved automatically. · Oct 27, 2025 at 8:25pm CET

Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:25pm CET

Filip Hoang commented · Oct 27, 2025 at 8:26pm CET

Please PM for this one since it was caused by a deploy. Sofia Petrov Liam Walker

Sofia Petrov commented · Oct 27, 2025 at 8:46pm CET

Post Mortem: App Service 5xx Errors (Platform)

Overview
An incident involving elevated HTTP 5xx error rates was detected on the Microsoft Azure Platform in region nbg-2 and quickly escalated on October 27, 2025 between 8:17pm and 8:25pm CET. Error rates exceeded automated thresholds and triggered a Sev2 (critical) alarm, affecting platform stability and user experience during the period.
Timeline

Time (CET)
Event Description

8:17pm
Incident started: HTTP 5xx errors exceeded 2% (measured at 3.4% avg over last 10 min)

8:17pm
Notifications posted to #better-stack-monitoring and #status-page-updates (Slack)

8:17pm-8:20pm
On-call Clara M. notified; escalation policy triggered for entire team

8:18pm
Investigation requested by Filip Hoang; Sofia Petrov assigned and acknowledged incident

8:19pm
Initial troubleshooting by Sofia Petrov underway

8:20pm
Error rate spikes: 6.8% avg over last 10 min. Possible culprit: @Liam Walker’s deploy

8:20pm
Revert operation initiated; relevant stakeholders notified via email

8:22pm
Revert still in progress; updates shared on Slack status channels

8:23pm
Incident acknowledged in monitoring and status Slack channels

8:24pm
Revert completed; platform error rate starts to decline

8:25pm
Error rate normalized (1.2% avg over last 10 min); incident automatically resolved

Impact

Users experienced increased 5xx errors in the affected region, potentially resulting in failed API calls and degraded application functionality for several minutes.
Monitoring flagged the error as critical (Sev2); incident visibility extended to engineering, ops, and affected team leads.

Root Cause

A recent deployment by Liam Walker inadvertently caused an error spike, increasing the HTTP 5xx error rate past threshold levels (up to 6.8% avg).
Error was limited to platform services in region nbg-2.

Resolution

Rapid identification of the problematic deployment led to an immediate revert operation initiated and completed by Sofia Petrov.
Error rate returned to acceptable levels (1.2% avg) and incident resolved automatically by monitoring system.

Communication and Escalation

Simple team-wide escalation policy activated; all relevant team members updated via Slack and email throughout incident lifecycle.
Status page and internal monitoring channels reflected timely status changes and resolution.

Action Items

Conduct deeper analysis of deployment process to ensure early detection of error-causing changes.
Add automated smoke tests for AppService deployments targeting HTTP error rates and service health indicators.
Consider adjusting alerting thresholds or aggregation windows to avoid false positives and improve signal fidelity.
Review internal escalation workflow for further communication optimization during Sev2 incidents.

This post mortem reflects the steps taken, impact, and future actions based on incident records and monitoring platform data.

Uppdaterad
23 dec, kl. 18:57 (UTC)

Production Endpoints Health recovered.

Uppdaterad
23 dec, kl. 18:56 (UTC)

Overall Services Health and Database Health recovered.

Uppdaterad
23 dec, kl. 18:22 (UTC)

Overall Services Health went down.

Uppdaterad
23 dec, kl. 18:14 (UTC)

Production Endpoints Health went down.

Uppdaterad
23 dec, kl. 18:13 (UTC)

Database Health is degraded.

Uppdaterad
23 dec, kl. 16:07 (UTC)

Production Endpoints Health recovered.

Uppdaterad
23 dec, kl. 16:07 (UTC)

Production Endpoints Health went down.

Uppdaterad
23 dec, kl. 15:03 (UTC)

Production Endpoints Health recovered.

Uppdaterad
23 dec, kl. 15:02 (UTC)

Production Endpoints Health went down.

Uppdaterad
23 dec, kl. 15:01 (UTC)

Production Endpoints Health recovered.

Skapad
23 dec, kl. 15:01 (UTC)

Production Endpoints Health went down.