Impending urgent security updates and system reboots

Incident Report for Patchman

Resolved

The first series of maintenance operations has been completed. There will be more future maintenance to cover some remaining issues, but since that will take some more time to prepare we will now be closing this incident and opening a new one once the new window is set.

If you have any questions about the past or future maintenance, please don't hesitate to contact us at support@patchman.co.

Posted Jan 19, 2018 - 12:41 CET

Update

This maintenance batch has completed, and all affected services are back online.

Posted Jan 17, 2018 - 05:15 CET

Update

The Portal, Customer API and Data Processing Backend have been temporarily disabled awaiting maintenance on multiple clusters these systems are dependent on.

Posted Jan 17, 2018 - 05:01 CET

Identified

Our next batch will run tonight at 5:00 AM CET, which will include multiple components of multiple clusters. To minimize the risk of unexpected side effects, we will temporary shut down the Portal webinterface and Customer API altogether while maintenance is running. We expect the downtime to last roughly 15 minutes.

Posted Jan 16, 2018 - 09:43 CET

Monitoring

Rebalancing has completed.

We will next update this incident once the next maintenance batch has been scheduled.

Posted Jan 15, 2018 - 10:43 CET

Update

After successful security maintenance last night, we will now be rebalancing our clusters to bring everything back to the original and stable state. This may once again cause some Agent API connections to reset.

Posted Jan 15, 2018 - 10:01 CET

Update

All preparations for the maintenance of tonight have been completed. None of the maintenance should have any noticeable service impact for our customers.

Posted Jan 14, 2018 - 22:44 CET

Update

A batch of servers will be rebooted around 3 AM UTC tonight, which includes machines in the message queue and Agent API clusters. To minimize the impact, we will be preventively taking certain nodes our of their respective clusters for the duration of the maintenance, as well as rebalance some clusters to handle the modifications. As a result of these changes, you may see Agent API connections being cycled, and the Data Processing backends will be running at lowered capacity for the night.

Posted Jan 14, 2018 - 22:10 CET

Identified

Our infrastructure provider has notified us that they have started maintenance on our systems which involves full machine reboots. Due to the severity of the situation, all of this will occur on a very short term. We will do our best to keep this post updated as maintenance progresses to inform you of which systems may be temporarily unavailable as a result of the maintenance.

Posted Jan 09, 2018 - 23:48 CET

Monitoring

All systems are fully operational again and the backlog on the Agent API has been processed.

We do expect to be taking more steps in the upcoming days so we will keep this incident open, but for now we will be monitoring our systems to validate our maintenance has no unintended side effects.

Posted Jan 05, 2018 - 14:02 CET

Update

Our final step of maintenance will involve a short downtime on the Portal and Customer API of roughly 2 minutes.

Posted Jan 05, 2018 - 13:56 CET

Update

All connections to the Agent API are now re-establishing.

Posted Jan 05, 2018 - 12:54 CET

Update

The networking problems have been resolved. We are continuing with a maintenance step that will reset all manager connections one final time.

Posted Jan 05, 2018 - 12:50 CET

Update

We are currently seeing networking issues on the Agent API that appear unrelated to the current maintenance. We will be investigating this before continuing with the maintenance.

Posted Jan 05, 2018 - 12:32 CET

Update

The Agent API is fully operational again. Note that due to the signficant downtime our data processing backend is currently dealing with a small backlog, which will be processed in the next hours.

Posted Jan 05, 2018 - 12:07 CET

Update

The Agent API is resuming normal operation. It may take some time for all connections to be restored.

Posted Jan 05, 2018 - 11:45 CET

Update

Some of the maintenance is unfortunately taking longer than initially estimated. The Agent API is currently still unvailable. We are working hard to make sure it resumes service as soon as possible.

Posted Jan 05, 2018 - 11:40 CET

Update

We are now temporarily disabling our entire Agent API cluster to give ourselves the freedom to perform all the necessary steps in rapid succession without requiring complete intermediate connection recovery. We expect this downtime to total 5 to 10 minutes.

Posted Jan 05, 2018 - 11:17 CET

Update

We have performed the initial batch of updates that can be performed without service interruption. Our next step is to perform selective updates to our Agent API platform. Since this cluster allows connections to fail over from one node to another we expect little interruption, but you may see temporary connection loss reported by your agent.

Posted Jan 05, 2018 - 10:26 CET

Investigating

We would like to make our customers aware that we, in conjunction with our infrastructure provider, will be performing high-priority security updates and system reboots on very short term to address recently discovered vulnerabilities in our infrastructure's CPU's architecture, published and covered under the names Meltdown (CVE-2017-5754) and Spectre (CVE-2017-5753, CVE-2017-5715).

We will be updating this post with more information as we progress and inform you about service unavailability as a result of this emergency maintenance.

Posted Jan 04, 2018 - 10:30 CET