Stalled Event Processing

Postmortem Jul 19, 2018 8:23 PM MDT

Status: Complete; action items done

Summary: A subset of customers experienced lack of notification delivery for approximately eight (8) hours due to a silent official database driver error affecting persistent tailing database connections upon an automated database failover.

Impact: An estimated 43 event notification broadcasts were delayed during off-peak hours.

Root Cause: The MongoDB Ruby driver did not break established persistent tailing cursor connections as functionally expected upon a database topology change. Hund’s event handling software did not encounter any errors but was not ingesting new events.

Trigger: An automatic database software update triggered a necessary database topology change by re-electing a new primary, which ended transmission of new events via existing tailing cursor connections.

Resolution: Upon notification, Hund immediately restarted the affected event handling application. After an investigation, the root cause was addressed by having Hund’s event handling software respond to database topology changes (for which the database driver did not handle, yet is responsible for) by restarting tailing cursors.

Detection: Customers informed us via support channels for not receiving notifications.

Action Items:

(Mitigation) Restart affected service applications.
(Prevention) Have event handling software restart tailing cursor connections when necessary, and when the database driver fails to do so.
(Prevention) Establish internal alarms for similar operating conditions.

Lessons Learned

What went well

Fast mitigation upon becoming aware of the issue (~10 minutes).
Quick discovery of the root cause (~20 minutes) by a team member.
Same-day resolution of the root cause.
All other parts of our service recovered as is typical for such database topology change.
Issue notifications which were out-of-date based on the current issue status upon processing of delayed events were indicated as being out-of-date in notifications.

What went wrong

Customers had to inform us of the issue during off-work periods.
Partial service unavailability where redundancy did not help.
Silent failure of software due to official database driver issue.
The official MongoDB Ruby driver did not behave as expected. (Hund uses other MongoDB database drivers for other languages, which also consume tailable cursors, and these responded appropriately under these conditions.)

Where we got lucky

Customers informed us of the issue.
Limited customer impact during off-peak notification hours.

Timeline

2018-07-19 (all times UTC)

06:00 A database re-election was completed as a result of automatic maintenance/updates.
06:00 PARTIAL OUTAGE BEGINS — A Hund event handler service had its tailing cursors silently fail without the TCP connections being broken by the official database driver used by the application.
13:59 INCIDENT BEGINS — Customer escalation of issue.
14:01 Investigation begins.
14:12 PARTIAL OUTAGE ENDS — Restarted affected service applications. Delayed events are processed. Root cause analysis begins.
14:19 Root cause suggested by a team member.
14:35 Root cause determined.
16:26 Primary prevention measure proposed.
17:30 Root cause replicated.
17:57 Secondary prevention alarms implemented.
22:00 INCIDENT ENDS — Production services patched with primary prevention fix for the root cause.

Resolved Jul 19, 2018 4:00 PM MDT

Opened Jul 19, 2018 7:59 AM MDT

This issue was opened retrospectively.