DC3HAM shared infrastructure services Issue on 29th May 2024

UPDATE 2024-05-30 13:50 CEST

We have verified that all systems are currently working as designed. Availability of the affected services should be back to normal. We will continue to closely monitor the situation.


UPDATE 2024-05-29 17:41 CEST

mitigation is still ongoing as large volumes have to be moved. we continue to work 24×7


Dear Customer,

We are experiencing temporary downtimes of shared infrastructure services like Gitlab, Harbor, Rancher. Customer environments are not affected directly however in some cases e.g. some deployments are not possible. investigations have been difficult but we are are working with highest priority on the mitigation of the problem.

Best regards
Your CONVOTIS Munich Managed Service Team

DC3HAM Firewall Issue on 9th May 2023

Dear Customer,

A standard change on the internal Firewall in our DC3HAM colocation triggered a high availability problem because of a years old unnoticed configuration error. This error resulted in the intrusion prevention blocking hosts that should not have been blocked, including DNS. These cascading error unfortunately took a while to cleanup.

A complete review of the firewall rules is already underway by 2 people independently.

Best regards
Your MCON Managed Service Team

Azure Networking – Multiple regions – Mitigated

Update 2023-01-25 12:26 CET

Summary of Impact: Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI.

Preliminary Root Cause: We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity between services within regions, as well as ExpressRoute connections.

Mitigation: We identified a recent change to WAN as the underlying cause and have rolled back this change. Networking telemetry shows recovery from 09:00 UTC onwards across all regions and services, with the final networking equipment recovering at 09:35 UTC. Most impacted Microsoft services automatically recovered once network connectivity was restored, and we worked to recover the remaining impacted services.

Next Steps: We will follow up in 3 days with a preliminary Post Incident Report (PIR), which will cover the initial root cause and repair items. We’ll follow that up 14 days later with a final PIR where we will share a deep dive into the incident.
You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts (https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation) and you will be notified via your preferred communication channel(s).


***************************************************

Update 2023-01-25 11:48 CET , see also https://azure.status.microsoft/de-de/status

Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers may have experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365, PowerBI.

We’ve determined the network connectivity issue was occurring with devices across the Microsoft Wide Area Network (WAN). This impacted connectivity between clients on the internet to Azure, as well as connectivity between services in datacenters, as well as ExpressRoute connections.

Current Status:
We have identified a recent change to WAN as the underlying cause, and have taken steps to roll back this change. Our telemetry shows consistent signs of recovery from 09:45 UTC onwards across multiple regions and services. Most customers should now see full recovery as WAN networking has recovered fully.

We are working to monitor and ensure full recovery for services that were impacted.

The next update will be in 30 minutes or as soon as we have further information.

This message was last updated at 10:28 UTC on 25 January 2023


***************************************************

Update 2023-01-25 10:37 CET

Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365, PowerBI.

We’ve determined the network connectivity issue is occurring with devices across the Microsoft Wide Area Network (WAN). This impacts connectivity between clients on the internet to Azure, as well as connectivity between services in datacenters, as well as ExpressRoute connections. The issue is causing impact in waves, peaking approximately every 30 minutes.

We have identified a recent WAN update as the likely underlying cause, and have taken steps to roll back this update. Our latest telemetry shows signs of recovery across multiple regions and services, and we are continuing to actively monitor the situation.

This message was last updated at 09:36 UTC on 25 January 2023


***************************************************
Dear Customer,

please be informed about the following current announcement on azure.status.microsoft:
“Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in multiple regions, as well as other Microsoft services.

We are actively investigating and will share updates as soon as more is known.

This message was last updated at 08:53 UTC on 25 January 2023 ”

Best regards
Your MCON Managed Service Team 


Resolved: DC3HAM Network Issue on 19th July 2022

Dear customers,

We are experiencing an outage in our colocation DC3.HAM. We are currently investigating the root cause and will update you soonest.

We apologize for any inconvenience.

update 21:50: lumen confirmed a problem in their network and started fixing. we escalated to Lumen management.

update Lumen 22:42: As this network fault is impacting multiple clients, the event has increased visibility with Lumen leadership. As such, client trouble tickets associated to this fault have been automatically escalated to higher priority.

update Lumen 00:20: Further troubleshooting has isolated the trouble to a local providers network. The local provider has dispatched a field team. Work is underway to obtain an estimated time of arrival.

update 03:30: Lumen restored the connectivity, all systems are reachable again

DC3HAM: Lumen Network Issue on 12th April 2022

UPDATE 2022-04-12 15:15 CEST

Lumen finally succeeded to reconnect their datacenter in Hamburg which hosts our colocation DC3.HAM.
We have been checking and verifying all systems afterwards.
Systems including our ticket system are back and available.

UPDATE 2022-04-12 14:35 CEST

We are checking and verifying all systems and monitorings


UPDATE 2022-04-12 14:30 CEST

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

The Lumen NOC advises some services have begun to clear and the local provider continues to repair the remaining damaged fiber cable.

Checking on local network connections.

***************************************************


UPDATE 2022-04-12 11:48 AM CEST

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

The cause of the services interruption identified as Force Majeure Dortmund, Germany. Fibre maintainers are on site and work is ongoing.

We continue to push for ETR.

***************************************************


UPDATE 2022-04-12 09:56 AM CEST

Our colocation DC3HAM is still not available. The carrier Lumen is working on the problem.
Unfortunately also our ticket system is affected, we are reachable by mail.
We continue to push for ETR.


Dear customers,

We are experiencing an outage in our colocation DC3.HAM, obviously caused by our provider Lumen. We are in escalation contact with Lument about this and will update you soonest.

We apologize for any inconvenience.

DC3HAM Network Outage

Lumen finally succeeded around 01:30 am on Saturday to reconnect their datacenter in Hamburg which hosts our colocation DC3.HAM.
We have been checkng and verifying all systems afterwards.
Systems including our ticket system are back and available.
We will follow up with Lumen on an incident report.

——————————————————————————————-

Colocation DC3 seems to be back online, we are checking related systems from MCON side

——————————————————————————————-

*** CASCADED EXTERNAL NOTES 2022-03-12 00:05:37 GMT From CASE: 23317190 – SM Parent
***************************************************
[CUSTOMER UPDATE] EMEA Service Desk ()

[SUMMARY OF WORK]
Good Morning

We are now seeing the services restored, we have asked the vendor to provide a full RFO.

We will keep you updated with all progress.

Kind Regards
Lumen

[PLAN OF ACTION]
Investigating RFO

[TIME – NOW] 2022-03-12 00:05 (UTC)
***************************************************

UPDATE 2022-03-11 23:56 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Colt partner has confirmed the completion of splicing however Colt customers services are still down. Partner has been requested to recheck the splicing. We will keep you updated on our progress via this email address. Thank you

We will continue to push for an ETR.

[PLAN OF ACTION]

[TIME – NOW] 2022-03-11 20:58 (UTC)

[UPDATE ETA]

***************************************************


UPDATE 2022-03-11 19:06 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Please be advised the local provider last mile field engineers continue in repair efforts for fix.

We continue to push for ETR.

[PLAN OF ACTION]

[TIME – NOW] 2022-03-11 17:58 (UTC)

[UPDATE ETA] 2022-03-11 19:15 (UTC)

***************************************************


UPDATE 2022-03-11 18:10 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Please be advised we are pushing for an ETR from the local provider. They have stated cable repair preperation is ongoing and will update further.

[PLAN OF ACTION]

[TIME – NOW] 2022-03-11 16:40 (UTC)

[UPDATE ETA] 2022-03-11 17:40 (UTC)

***************************************************


UPDATE 2022-03-11 15:59 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Please be advised our local providers last mile partner confirmed that situation is complex as damage location is occupied with heavy construction machinery which need to be cleared for digging work. Civil work is ongoing and expected time of restoration is awaited.

[PLAN OF ACTION]

We will follow up with further update in next 2 hours.

[TIME – NOW] 2022-03-11 14:42 (UTC)

[UPDATE ETA] 2022-03-11 16:42 (UTC)

***************************************************

Next update by: 2022-03-11 16:45 GMT


UPDATE 2022-03-11 14:54 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Please be advised that we continue to work with the local provider for updates and progress in relation to this case, we have requested an urgent update and confirmation of when service will be restored, as original ETR provided has now passed, we will aim to provide a further update in the next 60 minutes

[PLAN OF ACTION]

await local provider update and update customer once feedbacj received

[TIME – NOW] 2022-03-11 13:43 (UTC)

[UPDATE ETA] 2022-03-11 14:43 (UTC)

***************************************************

Next update by: 2022-03-11 14:45 GMT


UPDATE 2022-03-11 14:20 AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Afternoon

Please be advised that we can confirm engineers continue to work to restore service, as advised previously, we have been given an ETR of 13:00 GMT, and this still stands at this time , however damage to fibre was extensive so this maybe pushed back ,we will aim to provide a further update in the next 60 minutes

[PLAN OF ACTION]

await local provider update and update customer once feedback received

[TIME – NOW] 2022-03-11 12:20 (UTC)

[UPDATE ETA] 2022-03-11 13:20 (UTC)

***************************************************

Next update by: 2022-03-11 13:20 GMT


UPDATE 2022-03-11 08:18AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (ScLo)

[SUMMARY OF WORK]

Good Morning

Please be advised that we can confirm that engineers are onsite and working to restore service, the fibre break is located at Wuppertal City Germany, the local provider has confirmed that the ETR for completion of the work is 13:00 GMT

We will aim to provide a further update around 12:00-12:30 GMT to confirm if we are still on target for the ETR, once we have this confirmation, we will forward this over to you

[PLAN OF ACTION]

chase local provider around 12:30 to confirm we are still on target for ETR provided, once confirmed update customer

[TIME – NOW] 2022-03-11 09:42 (UTC)

***************************************************

Next update by: 2022-03-11 12:15 GMT


UPDATE 2022-03-11 08:18AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk ()

[SUMMARY OF WORK]

Good Morning

We are still awaiting testing from the vendor. We will keep you updated in all progress.

Kind Regards

Lumen

[PLAN OF ACTION]

Investigating

[TIME – NOW] 2022-03-11 07:15 (UTC)

***************************************************


UPDATE 2022-03-11 07:19AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (AsHa)

[SUMMARY OF WORK]

Good Afternoon,

Field engineers are actively repairing the fault and we shall update you as soon as information is available.

Kind Regards,

Lumen

[PLAN OF ACTION]

Engage Local Carrier

[TIME – NOW] 2022-03-11 06:17 (UTC)

***************************************************


UPDATE 2022-03-11 03:37AM CET

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (AsHa)

[SUMMARY OF WORK]

Good Afternoon,

Our local carrier have informed us there engineers are expected to arrive at the affected location at 04:30 GMT. We will update you at this time of there findings.

Kind Regards

Lumen

[PLAN OF ACTION]

Engage Local Carrier

[TIME – NOW] 2022-03-11 02:23 (UTC)

***************************************************


UPDATE 2022-03-11 02:39AM CET


[CUSTOMER UPDATE] EMEA Service Desk (AsHa)

[SUMMARY OF WORK]

Good Afternoon,

There is an issue in our partners network on a link between Dortmund and Dusseldorf, Germany.We will inform you accordingly of any information as it becomes available.

Kind Regards

Lumen

[PLAN OF ACTION]

Engage Local Carrier

[TIME – NOW] 2022-03-11 01:36 (UTC)

***************************************************

Next update by: 2022-03-11 02:40 GMT


UPDATE 2022-03-11 02:22AM CET
no update from support of datacenter carrier LUMEN
escalation level has been raised


UPDATE 2022-03-11 00:56AM CET
ticket has been raised at support of datacenter carrier LUMEN

***************************************************

[CUSTOMER UPDATE] EMEA Service Desk (AsHa)

[SUMMARY OF WORK]

Good Afternoon,

Your services are affected by a major outage in our local carrier’s network. We are engaging them and will updat eyou accordingly.

Kind Regards,

Lumen

[PLAN OF ACTION]

Engage Local Carrier

[TIME – NOW] 2022-03-10 23:57 (UTC)

**********************************************

Next update by: 2022-03-11 01:00 GMT


UPDATE 2022-03-10 11:48PM CET
we just identified that uplinks of or datacenter carrier LUMEN are down right now
this being said we are currently tracking with their support to find a quick solution


UPDATE 2022-03-10 11:00PM CET
network is currently not working as expected
thus sites and services are not available right now
we are working with high pressure to resolve this issue as fast as possible

Resolved: Outage DC3HAM

Dear Customer,

a trivial change on our external firewall with a (normal) subsequent sync to the secondary device caused both firewalls to go into “disabled”
state and not forward any packets anymore.

An deactivation and activation of the High Availability solved the problem.

The connection interruption lasted from 11:08 – 11:41 CEST, we apologize for any inconvenience.

We are in contact with the developers to find the rootcause of this issue.

We apologize for any inconvenience.

Resolved: DC3HAM Storage Outage

RESOLVED

UPDATE 2021-08-28 02:59PM CEST
On Friday 27th August 19:30 a redundant storage cluster in our colocation DC3.HAM was failing during normal operations.
After onsite ananlyis we found that the storage stopped all services due to a suspected split brain error.
As a result of the storage cluster virtual servers running on VMware could not run properly.

The repair of the storage cluster was started immediately after analysis and finished around 28th August 5a.m.
After storage recovery all running virtual servers have been restarted and checked, all production systems have been up and running after 28th August 09:45 a.m.

We are in further analysis of the root cause.

UPDATE 2021-08-28 10:25AM CEST
most systems are back, we are working to fix remaining problems mainly on QA system

UPDATE 2021-08-27 07:30PM CEST
VM storage cluster is currently not working as expected
thus sites and services are not available right now
we are working with high pressure to resolve this issue as fast as possible

Resolved: DC3HAM Network Outage

Update 04:26 PM CEST:

RESOLVED

Commercial power was restored, thus restoring services to a stable state.

We will provide details about exact root cause once we received from provider Lumen.

Update 04:12 PM CEST:

Transport NOC reports the main power breakers have been reset and commercial power is restored to the location. The team is working to turn individual breakers on one at a time to restore equipment. Services will begin to restore as each breaker is energized.

Update 02:59 PM CEST:

Field Operations have arrived on site and determined a commercial power failure to be the cause of impact to services. The local power provider has been engaged to assist with restoral efforts.

Update 02:05 PM CEST:

Lumen is still working on the issue

There is a major network event in Frankfurt, Germany that effects our services on a global scale, depending on the routing our services in our Colocation in Hamburg might not be reachable. The provider Lumen is working on that issue, we will update this as soon as we know more