Overall Services Health, Database Health, and 1 other service are down
Resolved
Feb 10 at 03:24pm UTC
Incident summary for 🟣 App Service 5xx Errors: Platform (Oct 27 at 8:17pm CET)
Cause: AppService-5xxErrorSpike
Started at: Oct 27 at 8:17pm CET
Length: 8 minutes
Acknowledged by: Sofia Petrov
Slack channel: #better-stack-monitoring, #status-page-updates
View incident in Better Stack ↗
AI post-mortem
A deployment caused a spike in HTTP 5xx errors in the nbg-2 platform, pushing the 5xx rate above the alert threshold. The on-call engineer identified the faulty deploy, reverted it, and the error rate fell back below thresholds; monitoring then auto-resolved the incident. A post‑mortem recommended tighter deployment checks and automated smoke tests to prevent recurrence.
Metadata
Condition #1 - Metric value:
3.4
Condition #1 - Time aggregation:
Average
Condition type:
SingleResourceMultipleMetricCriteria
Description:
HTTP 5xx errors exceeded 2% over the past 10 minutes.
Monitoring service:
Platform
Region:
nbg-2
Severity:
Sev2
severity:
critical
Signal type:
Metric
Timeline
Incident started. · Oct 27, 2025 at 8:17pm CET
Incident update posted to #status-page-updates. Slack · Oct 27, 2025 at 8:17pm CET
Sent an email to Clara Mendes at clara.mendes@shadowfusion.cloud. · Oct 27, 2025 at 8:17pm CET
Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:17pm CET
Filip Hoang commented · Oct 27, 2025 at 8:18pm CET
Can you please investigate ASAP Sofia Petrov ? Please feel free to escalate if not resolved soon.
Sofia Petrov commented · Oct 27, 2025 at 8:19pm CET
Already on it Filip Hoang.
Sent an email to Adam Novak at adam.novak@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET
Sent an email to Sofia Petrov at sofia.petrov@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET
Sent an email to Priya Nair at priya.nair@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET
Sent an email to Filip Hoang at filip@betterstack.com. · Oct 27, 2025 at 8:20pm CET
Sent an email to Erik Johansson at erik.johansson@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET
Sent an email to Liam Walker at liam.walker@shadowfusion.cloud. · Oct 27, 2025 at 8:20pm CET
Sofia Petrov commented · Oct 27, 2025 at 8:20pm CET
Seems it's related to Liam Walker's deploy that's causing an increased error rate. Reverting NOW.
Sofia Petrov commented · Oct 27, 2025 at 8:22pm CET
Revert still in progress, seems it's going up though.
Filip Hoang commented · Oct 27, 2025 at 8:23pm CET
If you're on it please ACK the incident Sofia Petrov .
Sofia Petrov commented · Oct 27, 2025 at 8:23pm CET
Sorry my bad!
Incident acknowledged by Sofia Petrov. · Oct 27, 2025 at 8:23pm CET
Incident update posted to #status-page-updates. Slack · Oct 27, 2025 at 8:23pm CET
Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:23pm CET
Sofia Petrov commented · Oct 27, 2025 at 8:24pm CET
Revert completed.
Incident resolved automatically. · Oct 27, 2025 at 8:25pm CET
Incident update posted to #better-stack-monitoring. Slack · Oct 27, 2025 at 8:25pm CET
Filip Hoang commented · Oct 27, 2025 at 8:26pm CET
Please PM for this one since it was caused by a deploy. Sofia Petrov Liam Walker
Sofia Petrov commented · Oct 27, 2025 at 8:46pm CET
Post Mortem: App Service 5xx Errors (Platform)
Overview
An incident involving elevated HTTP 5xx error rates was detected on the Microsoft Azure Platform in region nbg-2 and quickly escalated on October 27, 2025 between 8:17pm and 8:25pm CET. Error rates exceeded automated thresholds and triggered a Sev2 (critical) alarm, affecting platform stability and user experience during the period.
Timeline
Time (CET)
Event Description
8:17pm
Incident started: HTTP 5xx errors exceeded 2% (measured at 3.4% avg over last 10 min)
8:17pm
Notifications posted to #better-stack-monitoring and #status-page-updates (Slack)
8:17pm-8:20pm
On-call Clara M. notified; escalation policy triggered for entire team
8:18pm
Investigation requested by Filip Hoang; Sofia Petrov assigned and acknowledged incident
8:19pm
Initial troubleshooting by Sofia Petrov underway
8:20pm
Error rate spikes: 6.8% avg over last 10 min. Possible culprit: @Liam Walker’s deploy
8:20pm
Revert operation initiated; relevant stakeholders notified via email
8:22pm
Revert still in progress; updates shared on Slack status channels
8:23pm
Incident acknowledged in monitoring and status Slack channels
8:24pm
Revert completed; platform error rate starts to decline
8:25pm
Error rate normalized (1.2% avg over last 10 min); incident automatically resolved
Impact
Users experienced increased 5xx errors in the affected region, potentially resulting in failed API calls and degraded application functionality for several minutes.
Monitoring flagged the error as critical (Sev2); incident visibility extended to engineering, ops, and affected team leads.
Root Cause
A recent deployment by Liam Walker inadvertently caused an error spike, increasing the HTTP 5xx error rate past threshold levels (up to 6.8% avg).
Error was limited to platform services in region nbg-2.
Resolution
Rapid identification of the problematic deployment led to an immediate revert operation initiated and completed by Sofia Petrov.
Error rate returned to acceptable levels (1.2% avg) and incident resolved automatically by monitoring system.
Communication and Escalation
Simple team-wide escalation policy activated; all relevant team members updated via Slack and email throughout incident lifecycle.
Status page and internal monitoring channels reflected timely status changes and resolution.
Action Items
Conduct deeper analysis of deployment process to ensure early detection of error-causing changes.
Add automated smoke tests for AppService deployments targeting HTTP error rates and service health indicators.
Consider adjusting alerting thresholds or aggregation windows to avoid false positives and improve signal fidelity.
Review internal escalation workflow for further communication optimization during Sev2 incidents.
This post mortem reflects the steps taken, impact, and future actions based on incident records and monitoring platform data.
Affected services
Updated
Dec 23 at 06:57pm UTC
Production Endpoints Health recovered.
Affected services
Updated
Dec 23 at 06:56pm UTC
Overall Services Health and Database Health recovered.
Affected services
Updated
Dec 23 at 06:22pm UTC
Overall Services Health went down.
Affected services
Updated
Dec 23 at 06:14pm UTC
Production Endpoints Health went down.
Affected services
Updated
Dec 23 at 06:13pm UTC
Database Health is degraded.
Affected services
Updated
Dec 23 at 04:07pm UTC
Production Endpoints Health recovered.
Affected services
Updated
Dec 23 at 04:07pm UTC
Production Endpoints Health went down.
Affected services
Updated
Dec 23 at 03:03pm UTC
Production Endpoints Health recovered.
Affected services
Updated
Dec 23 at 03:02pm UTC
Production Endpoints Health went down.
Affected services
Updated
Dec 23 at 03:01pm UTC
Production Endpoints Health recovered.
Affected services
Created
Dec 23 at 03:01pm UTC
Production Endpoints Health went down.
Affected services