Portal, Customer API and data processing backend unavailable

Incident Report for Patchman

Resolved

There have been no regressions on our platform. We will be planning maintenance for reintroducing the previously faulty node into the database cluster when it is required. For now, we will be closing this incident.

Posted Jan 26, 2018 - 16:23 CET

Monitoring

We continue to monitor the situation. While we have finished what we considered to be emergency maintenance on the faulty database node, we are still performing lower-profile tests on it to validate our steps have indeed solved the problems.

Posted Jan 26, 2018 - 12:13 CET

Update

The cluster rebalancing has been completed and all services are fully operational. Emergency maintenance on the faulty database node is ongoing.

Posted Jan 26, 2018 - 10:43 CET

Identified

Some of the necessary steps to prevent future regressions require significant downtime to one of our database nodes. We will be doing rebalancing in our database cluster to take the problematic node out of service. This may cause short periods of unavailability on the Portal and Customer API.

Posted Jan 26, 2018 - 10:31 CET

Update

All systems have remained stable over the past hour. We are currently still investigating the root cause and taking measures to prevent this from occurring again in the future.

Posted Jan 26, 2018 - 10:06 CET

Monitoring

All services are back online. We are closely monitoring the situation for regressions, as well as managing the problem that cause this outage in the first place.

Posted Jan 26, 2018 - 09:15 CET

Update

The Portal and Customer API are back online. We are currently still working on the data processing backend.

Posted Jan 26, 2018 - 09:10 CET

Identified

Due to issues with our database cluster we are seeing intermittent failures on various services. The cause has been identified and we are currently working on a solution.

Posted Jan 26, 2018 - 09:05 CET