I’d like to provide you with an update on an email outage we are experiencing on Cluster A of our Email Service. Resellers on Cluster B are not impacted by this.
First off, let me say that we are incredibly sorry about the impact this is undoubtedly having on our resellers and many of their customers. This shouldn’t have happened and while our focus right now is on letting the team get the system back online we will be looking very closely at how this happened to ensure it doesn’t happen again.
Here’s what we know at this time:
About eight hours ago we suffered a major hardware failure in a NetApp file storage system that is an integral part of our Email Service. The system is of course built with all kinds of redundancy but this hardware failure came at the worst possible time. A hardware issue affecting redundancy had been found earlier and was corrected in Cluster B last week, with a correction for Cluster A awaiting the arrival of the needed hardware. However, today a separate hardware issue arose when a disk shelf controller in one of our NetApps failed. It is the confluence of these two scenarios that puts us in this situation.
We have now replaced the faulty hardware and we are close to having the NetApp back online and in service.
Unfortunately, about half of the mailboxes on Cluster A will now be largely offline while we rebuild the parts of the system that were impacted. Without drilling down too much into the overall system architecture there are a number of “volumes” within each cluster that store and manage mailboxes. Three of those volumes now need to be rebuilt, in sequence. As a volume is fully restored we will bring it back online.
The entire Operations Team and our Network Operating Center Team are working with developers throughout the evening and will continue to work on this until it is fully resolved. Our CEO Elliot and I are in regular contact with all these teams as well as our communications and support teams but, to be honest, we’re trying to stay out of their hair and let them do their jobs.
The big unknown right now is how long it will take to restore each one of these volumes within Cluster A. Until we let the rebuilding process continue for 6 to 8 hours we are not confident that any estimates we give at this point will have any meaning.
Right now our focus (in order) is a) rebuilding the volumes and putting them online as soon as we can, b) looking for alternative workarounds that may reduce either the scope or length of the outage, and c) creating and refining our estimates on when full service will be restored.
It’s probably worth noting that the distribution of mailboxes within the volumes is largely random and therefore most resellers on Cluster A will find that some but not all of their customers are impacted.
Once we have the service fully restored we will start the process of investigating root causes and determining how we can make sure this doesn’t happen again.
Once again, on behalf of everyone I want to state how deeply sorry we are that this has happened and we truly appreciate your patience and understanding as we work to resolve the issue at hand.
The best place to get updates on our progress is the OpenSRS Status page. I’ll post further updates here if we need to share more than we normally do in a Status Update.
VP, Product Management & Marketing