Resolved: Outage DC3HAM

Dear Customer,

a trivial change on our external firewall with a (normal) subsequent sync to the secondary device caused both firewalls to go into “disabled”
state and not forward any packets anymore.

An deactivation and activation of the High Availability solved the problem.

The connection interruption lasted from 11:08 – 11:41 CEST, we apologize for any inconvenience.

We are in contact with the developers to find the rootcause of this issue.

We apologize for any inconvenience.

Resolved: DC3HAM Storage Outage

RESOLVED

UPDATE 2021-08-28 02:59PM CEST
On Friday 27th August 19:30 a redundant storage cluster in our colocation DC3.HAM was failing during normal operations.
After onsite ananlyis we found that the storage stopped all services due to a suspected split brain error.
As a result of the storage cluster virtual servers running on VMware could not run properly.

The repair of the storage cluster was started immediately after analysis and finished around 28th August 5a.m.
After storage recovery all running virtual servers have been restarted and checked, all production systems have been up and running after 28th August 09:45 a.m.

We are in further analysis of the root cause.

UPDATE 2021-08-28 10:25AM CEST
most systems are back, we are working to fix remaining problems mainly on QA system

UPDATE 2021-08-27 07:30PM CEST
VM storage cluster is currently not working as expected
thus sites and services are not available right now
we are working with high pressure to resolve this issue as fast as possible

Resolved: DC3HAM Network Outage

Update 04:26 PM CEST:

RESOLVED

Commercial power was restored, thus restoring services to a stable state.

We will provide details about exact root cause once we received from provider Lumen.

Update 04:12 PM CEST:

Transport NOC reports the main power breakers have been reset and commercial power is restored to the location. The team is working to turn individual breakers on one at a time to restore equipment. Services will begin to restore as each breaker is energized.

Update 02:59 PM CEST:

Field Operations have arrived on site and determined a commercial power failure to be the cause of impact to services. The local power provider has been engaged to assist with restoral efforts.

Update 02:05 PM CEST:

Lumen is still working on the issue

There is a major network event in Frankfurt, Germany that effects our services on a global scale, depending on the routing our services in our Colocation in Hamburg might not be reachable. The provider Lumen is working on that issue, we will update this as soon as we know more

Resolved: Connectivity issues in AWS AZ in Region Frankfurt

https://status.aws.amazon.com/

7:24 AM PDT Starting at 5:07 AM PDT we experienced increase connectivity issues for some instances, degraded performance for some EBS volumes and increased error rates and latencies for the EC2 APIs in a single Availability Zone (euc1-az3) in the EU-CENTRAL-1 Region. By 6:03 AM PDT, API error rates had returned to normal levels, but some Auto Scaling workflows continued to see delays until 6:35 AM PDT. By 6:10 AM PDT, the vast majority of EBS volumes with degraded performance had been resolved as well, and by 7:05 AM PDT, the vast majority of affected instances had been recovered, some of which may have experienced a power cycle. A small number of remaining instances and hosted on hardware which was adversely affected by this event and require additional attention. We continue to work to recover all affected instances and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances if possible.

6:29 AM PDT We continue to make progress in resolving the connectivity issues affecting some instances in a single Availability Zone (euc1-az3) in the EU-CENTRAL-1 Region. The increased error rates and latencies for the RunInstance and CreateSnapshot APIs have been resolved, as well as the degraded performance for some EC2 volumes within the affected Availability Zone. We continue to work on the remaining EC2 instances that are still impaired as a result of this event, some of which may have experienced a power cycle. While we do not expect any further impact at this stage, we would recommend continuing to utilize other Availability Zones in the EU-CENTRAL-1 region until this issue has been resolved.
6:05 AM PDT We are seeing increased error rates and latencies for the RunInstances and CreateSnapshot APIs, and increased connectivity issues for some instances in a single Availability Zone (euc1-az3) in the EU-CENTRAL-1 Region. We have resolved the networking issues that affected the majority of instances within the affected Availability Zone, but continue to work on some instances that are experiencing degraded performance for some EBS volumes. Other Availability Zones are not affected by this issue. We would recommend failing away from the affected Availability Zone until this issue has been resolved.
5:29 AM PDT We are investigating increased error rates and latencies for the EC2 APIs and connectivity issues for some instances in a single Availability Zone in the EU-CENTRAL-1 Region

while performing regular updates there is an issue on AWS (see below)
we are publishing this because it might have impact on further proceedings during our regular patching

https://status.aws.amazon.com/
1:24 PM PDT We are investigating connectivity issues for some EC2 instances in a single Availability Zone (euc1-az1) in the EU-CENTRAL-1 Region.

https://status.aws.amazon.com/
6:54 PM PDT RESOLVED: Connectivity Issues & API Errors connectivity issues for some EC2 instances in a single Availability Zone (euc1-az1) in the EU-CENTRAL-1 Region has been resolved and the service is operating normally.

 

DC3.HAM incident

we are encountering a problem in our colocation DC3.HAM since around 11:45, some workloads are not reachable. we are analyzing the problem and will update you soonest

update: recovery started at 11:53 and was finished latest 12:00, all systems are back. several VMware hosts lost storage connection, virtual machines have been migrated to other hosts automatically.

update: the interruption in the storage connection could have been caused by a bug in the switch firmware, a related patch will be implemented during the regular patching thursday to friday night

dc3.ham connectivity interruptions due to Lumen maintenance

Dear customer,

Lumen as supplier of our Colocation DC3.HAM has been carrying out internal maintenance within its network. This has been designated as ESSENTIAL. The nature of this work was to perform cable diversion works. Rerouting of the Lumen cable in Leverkusen due to the construction of a new gas pipeline.

Unfortunately we experienced connectivity interruptions to our network between 21th April 11 pm and 22th April 00:44 am caused by this maintenance.