Closing notes on the Cluster A Email Service Interruption
First off, I’d like to apologize again for the problems that resulted from the problems last week on Cluster A of our email service. Email is a mission-critical service. We know how awful it is to have your personal and business communications disrupted. We are deeply sorry for any problems that resulted from this interruption.
After around-the-clock work last week to restore full service to our impacted resellers, and their end-users on Cluster A, our team took some time today to review what happened with last week’s service degradation.
Last Tuesday, a shelf controller hardware failure meant that 14 disks required a rebuild. This resulted in the degradation of multiple storage volumes. This failure affected 50% of customer mailboxes on OpenSRS Email Service – Cluster A. The restoration process was consecutive for the affected devices and therefore took a number of days to complete. To resolve the issue, we replaced the shelf controller and rebuilt 14 disks. During the service interruption, we made temporary mail stores available to customers. On Friday, once restoration was complete, all mail content (messages and folders) were merged from the temporary volumes to the user’s original mailbox.
As with any service problem of this magnitude, it is essential we take steps to make sure it does not happen again. Before the end of the month we are making storage architecture changes to Cluster A to ensure that we eliminate the chance that a similar event with storage will occur in the future.
Again, let me say that we are incredibly sorry about the impact this undoubtedly had on you and many of your customers.
-
http://pastbedti.me Mathias Stjernström
-
Ken Schafer




