OpenSRS: Reseller Friendly since 1999
 

Closing notes on the Cluster A Email Service Interruption

First off, I’d like to apologize again for the problems that resulted from the problems last week on Cluster A of our email service. Email is a mission-critical service. We know how awful it is to have your personal and business communications disrupted. We are deeply sorry for any problems that resulted from this interruption.

After around-the-clock work last week to restore full service to our impacted resellers, and their end-users on Cluster A, our team took some time today to review what happened with last week’s service degradation.

Last Tuesday, a shelf controller hardware failure meant that 14 disks required a rebuild. This resulted in the degradation of multiple storage volumes. This failure affected 50% of customer mailboxes on OpenSRS Email Service – Cluster A. The restoration process was consecutive for the affected devices and therefore took a number of days to complete. To resolve the issue, we replaced the shelf controller and rebuilt 14 disks. During the service interruption, we made temporary mail stores available to customers. On Friday, once restoration was complete, all mail content (messages and folders) were merged from the temporary volumes to the user’s original mailbox.

As with any service problem of this magnitude, it is essential we take steps to make sure it does not happen again. Before the end of the month we are making storage architecture changes to Cluster A to ensure that we eliminate the chance that a similar event with storage will occur in the future.

Again, let me say that we are incredibly sorry about the impact this undoubtedly had on you and many of your customers.

  • http://pastbedti.me Mathias Stjernström

    Hi!

    I wonder if you can describe those architecture changes you are going to make?

    Cheers!

  • Ken Schafer

    @Mathias – We have had a NetApp cluster failover hardware solution called “head multipath” installed on Cluster B to protect against hardware failures such as head failure since it was launched several months ago. We were happy with the technology and planned on adding it to Cluster A to enhance the level of sort of redundancy there. We were in preparations to do the hardware upgrade when one of the heads in Cluster A DID in fact fail. Talk about bad timing!

    Now that the dust has settled on the disruption caused by the head failure our Operations team is planning on rolling out that head multipath solution later this week.

    We’re always doing upgrades and enhancements to the system architecture but this one is most relevant to the service interruption I was talking about.

    By the way, I see you’re using a .me name! That’s the first time I’ve seen one “in the wild”. Well done sir!

Become a Reseller

Sign Up Now