OpenSRS: Reseller Friendly since 1999
 

Posts Tagged ‘Email Service’

Letter to OpenSRS Email Resellers

Dear Customers -

On behalf of all of the members of the OpenSRS team, please accept our sincere and deepest apologies for the service disruption on Cluster A this past weekend.

Many of you have asked, “How could we have let this happen again?” We initially were led to believe that we had a software problem. We have now determined that the string of service problems on Cluster A are related to a hardware problem inside one of our NetApp devices.

Below is a letter of explanation I received from Jeff Goldstein, General Manager at NetApp Canada.

We are not without fault in this situation. Network-attached storage is complex and we trusted our vendor to provide us with accurate advice related to our problems. In hindsight, we should have pressed earlier for replacement hardware.

Please rest assured that we are dedicated to providing a reliable email service and will be working tirelessly to restore your confidence in us. An incident report is available at OpenSRS Status.

Sincerely,
Elliot Noss,
President and CEO, Tucows

Dear Elliot Noss,

I am writing today regarding the recent outage that occurred this past weekend with Cluster A of the OpenSRS Email Service.

As you are aware, Cluster A of the OpenSRS Email Service has experienced a number of service degradations related to issues with our NetApp storage device. Our engineers here at
NetApp worked closely with the technical operations and development teams at OpenSRS to trouble-shoot and resolve these issues. In each of the cases, we believed a software
fault was the cause.

The intermittent problem turned out to be due to the hardware shelf controller as well as firmware in one of our NetApp storage devices, which caused the issues on Cluster A.

We are deeply sorry for the inconvenience that resulted from these hardware and email service issues.

One of the promises we make to our customers is that our solutions provide highly available data management and in this case we let you down.

To begin to resolve this issue, we’re taking immediate action to replace the hardware and firmware in Cluster A at our expense. Our engineers will then test and evaluate the components involved to determine what specifically went wrong and apply those findings back into our own quality control
teams.

Our two companies have been working together for the past nine years. We value our relationship and will work hard to restore your confidence in NetApp and our solutions.

Again, please accept our sincere apologies.

Regards,

Jeff Goldstein
Canadian General Manager
NetApp Canada

The new Spam Settings page for OpenSRS Email Service

As mentioned in the previous posting, we’re in the midst of rolling out a new release of OpenSRS Email Service. The most visible of the changes that will be promoted to the live service next week is the end-user spam management settings page. To help you out, I’ve prepared a short screencast to show you what the end-user experience will be.

You can test it out for yourself in the Production Test Environment (PTE). The new release was promoted to PTE earlier today.

The approach to OpenSRS Email Releases

Earlier today, we promoted the latest version of code for our email service to the Production Test Environment (PTE). If you’re one of our email resellers, you should have already received an email from us letting you know about the release, so you can familiarize yourself with the changes before your users get them a week from now. Complete information about what’s changing in this release can be found on the release notes page.

In addition to feature releases, we’re constantly working to improve performance and reliability. Those releases where there is no end-user or reseller impact, beyond an improved overall experience, usually happen ‘behind the scenes’ and fairly frequently.

While we’re talking about releases, I wanted to take a minute to explain our approach. In general, we have two main goals in our releases:

  1. Address or remove bugs wherever and whenever we can.
  2. Add new features that provide the widest possible benefit across both the end-user and reseller user bases.

For example, in this release:

  • To help both end users and resellers, we added a new settings page that gives users the ability to change how their spam is handled. In particular, POP users can now choose to have their spam tagged and delivered to their Inbox. Then their spam will get downloaded with the rest of their mail, and they’ll no longer have to use Webmail to check the contents of their Spam folder. We have a screencast showing this functionality in another blog post.
  • To help end users, we added the ability to export contacts. Users could previously import contacts from Outlook-format address book files, and now they can export them as well.
  • To help resellers, we’ve added a way to mitigate the effects of phishing attacks targeting their user bases.

Our mission with OpenSRS Email Service is to provide an easy-to-use experience that gives the ‘power user’ enough functionality to keep them happy, while not overwhelming the average user with gizmos and whiz-bang that does nothing to help them read their mail quickly and easily.

Technical Debrief on October Cluster A Email Service Issue

Aj Mirani is Manager of Technical Operations for OpenSRS and is responsible for coordinating the strategic aspects of technical issues, long-term capacity planning and resource allocation for technical projects. He leads the team of Unix administrators and network administrators directly responsible for running the servers, network and storage devices across all platforms. Below he’s answered questions raised by resellers during last week’s service disruption.

Definitions

Linux: a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. OpenSRS uses the Linux operating system on all of our services.

Dovecot: an open source IMAP and POP3 server for Linux/Unix-like operating systems. OpenSRS uses the Dovecot mail server software across our Email platform.

NetApp: a storage and data management provider used by OpenSRS to manage our email service. This technology is used by many major service providers.

Bugs

Can you describe the Linux kernel bug that was found?

In 2.6.19, a patch was introduced to the Linux kernel by developer Neil Brown which added an “optimization” whereby if you have only TCP NFS mounts, the Linux kernel on the client will not listen for UDP NFS lock callback messages, as it was believed that an NFS server would always send lock messages over TCP if all mounts were also TCP.

However well-intentioned, this optimization does not hold true in all cases. For example, Netapp filers will still perform UDP NFS lock callback messages with their clients, even if they are using TCP NFS for all volume mounts.

This is not a bug on the part of Netapp, since there is no specification in the NFS RFCs that TCP mounts necessitate 100% TCP lock messages. Rather, it’s up to the NFS client to be available and listening to all NFS lock messages, whether TCP or UDP.

Moreover, Netapp explicitly chose to use UDP for some lock callback messages as the overhead on short messages of this type is significantly less in UDP than it is in TCP. In short, it scales better to do so for these messages.

The end result is that the NFS clients (IMAP servers in this case) were not able to perform full NFS locking all the time, and that resulted in clobbered writes, leading to corrupted indexes which necessitated full Dovecot index rebuilds.

Can you describe the bug that was found in Dovecot?

There were in fact two bugs found in Dovecot, both of which ultimately led to the same end result, that being a bad index file.

First, if a user logs in to their mailbox and Dovecot detects a problem with their index, it will attempt to reindex their messages. Should that user’s connection be closed for any reason, Dovecot will detect this but still continue with the reindexing if it has not yet completed it. This is good because, in terms of resource utilization, the reindex is a very expensive operation and Dovecot doesn’t want to have to do this more than necessary. The bug, ironically, is that once Dovecot went through the effort of reading every message in the user’s mailbox and it came time to actually write the index, a subroutine would detect that the user was no longer connected and abort the final write operation.

Second, the virtual size is the size of the message within the POP3 protocol, which can differ from the size of the message on disk. Under some conditions, Dovecot would do the work of rebuilding the table of virtual sizes, but would not ever write it out. The POP3 session would function normally from the user’s point of view, as the virtual size table was now in RAM, but the next time the user logged in, the virtual sizes would have to be reconstructed again, which caused a rescan of all mail in the mailbox.

Linux Questions

What distro and version of Linux are you using?

We have standardized on the Debian ‘stable’ release as a base with security patches. It has a proven track record with us and we like the rigorous prerequisites which need to be met before packages can be considered for this release. The following article describes the life-cycle of packages within Debian distributions prior to being considered for the ‘stable’ release:

Debian Package Life Cycle (Wikipedia)

During our initial architectural design of the mail platform, there was a lot of heated debate as to our selection of standard OS and hardware. In the end, after looking at all the other options, including Solaris and FreeBSD, we decided on Linux as the best candidate. We still believe this was the best choice not only from a reliability standpoint, but also from a performance perspective.

It sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage? (Greg Youngblood)

We are very cautious around any changes made to the production environment. Even small changes are made with a healthy amount of paranoia. Something as major as a kernel upgrade is not taken lightly and goes through a lot of scrutiny before reaching our production servers. Even when we consider an upgrade fit for production, we start with limited deployments on non-customer-impacting servers. As the upgrade proves itself reliable, we begin rolling it out to other environments. To give you an idea of how long this can take, this particular kernel spent about 45 days running problem-free across approximately 400 servers in our production environment prior to its final rollout to the Dovecot servers. This was after it spent significant time first in our development and subsequently in our QA environments. We do use NFS extensively across all of these environments. We specifically chose this release because of the major enhancements for NFS that it included. Unfortunately, even with rigorous testing, the reality is that in any environment, bugs do make it into production sometimes. We have many layers of protection in place to mitigate these. In this particular case, the combination of the problem with the kernel in conjunction with the Dovecot bug quickly pushed our load to literally ten times what we normally experience. We do make sure our systems have plenty of spare capacity and we can handle a lot of extra load. Unfortunately, not ten times regular capacity as we experienced last week.

Have you considered switching from Linux to Solaris? Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out. (Greg Youngblood)

This may be the case for NFSv4 at the moment. We don’t believe that NFSv4 is proven enough for us to use in production for now. From Netapp’s perspective, they strongly recommend using the ‘General Availability’ (GA) version of Ontap (the platform OS) if we are going to implement NFSv4. We are not comfortable using anything other than the ‘General Deployment’ (GD) releases however as these are the production-proven versions. Ontap GA is the equivalent of beta for Netapp and only has a very limited production deployment across their customer base. The GD releases are the most widely deployed, most stable versions. The short answer is that we won’t feel NFSv4 is suitable for production for at least another six to twelve months.

Dovecot Questions

Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there? (Greg Youngblood)

We worked with Timo Sirainen, author and primary maintainer of Dovecot, to patch the Dovecot source. These patches are currently in production across our production environment. Future versions of Dovecot will include these changes as standard.

If you implement bounces (preferably timeouts during delivery so messages stay in queues and don’t get lost assuming you can deliver them within 5-7 days), can you make it adjustable (if possible) with a setting in MAC or reseller interface? I can certainly understand why some would want bounce notifications, but probably not everyone will. I am on several lists that auto unsubscribe you on bounces, so for myself personally I prefer not to have them bounced. (Greg Youngblood)

Once we accept mail into our system, it will not bounce. Our architecture is tiered such that even if the underlying mail servers are unavailable, mail is queued internally until such time as it is delivered. Mail that is rejected does bounce at the perimeter only.

What can be done by OpenSRS about secondary MX records? Whilst emails are being queued by the primary MX, I presume that nothing would go to a secondary MX.

Currently the only way mail would go to the secondary MX is if the primary MX does not accept mail. Even through the worst SPAM attacks we have had sufficient capacity to still accept valid mail. We have yet to experience this situation.

System Architecture/Miscellaneous Questions

What type of monitoring system are you using? Why wasn’t this abnormal system behavior caught by your monitoring system in the early stages?

We are using a number of systems to trend and monitor. Primarily we use Nagios for monitoring with the help of numerous custom plugins we developed to provide a more robust testing suite. In Nagios alone we have in excess of 1500 monitoring points across the Email platform. Furthermore, we use Munin to trend additional metrics for long term planning and visibility into scalability trends.

On a micro-scale of say a few thousand users, these bugs were individually negligible (and extremely difficult to detect.) The user experience would have been normal during this period. It took many days of digging by an assembled team of our top engineers and system administrators working in shifts 24/7, in conjunction with some of the best developers in the Open Source world and Tier 3 NetApp engineers in order to nail this down.

The cumulative nature of the kernel locking problem combined with the Dovecot reindexing bug, both compounded by millions of users accessing the system, was the confluence of issues that caused the outage.

As a result of this incident we have added a number of additional monitoring points that will add insight and provide an early warning in the future. While it would be almost impossible to detect the specific failure, we’re in a very good position to detect and track things like locks and other NFS/Netapp interactions.

Why wasn’t Cluster B affected? (Greg Youngblood)

Cluster B, much like the half of Cluster A which remained online was, in fact, affected. Both Clusters use identical hardware and software. We were able to resolve the issue before we crossed the cascade threshold on those environments.

Why didn’t you move our mailboxes to Cluster B? (Edward Gore)

The short answer to this is that it would take well over a week to migrate every user from Cluster A to Cluster B. We are not able to disclose the exact amount of space being used by mailboxes, but it is measured in many terabytes. Cluster B resides in an entirely separate geographically different data center, so we would be limited to Internet transfer speeds between these two data center providers in order to conduct such a migration.

Saying ’40% of clients of Cluster A’ is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – via an API would be useful, we could then work with our clients who ARE affected! (Paul O’Hanlon)

Our hashing algorithm that distributes users across the cluster is highly efficient. We have found that it very evenly distributes users on a large scale. It is possible for us to provide lists of users on affected mailstores although it is unlikely it will be a feature that will be implemented into the API. In the case of last week’s incident, producing those reports would have pulled our engineers away from resolving the root cause and restoring mailboxes.

After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a backup system where empty mailboxes would have already been created in advance. Have you considered this idea? (Augustin L)

We have considered this and are still investigating options. There are some considerations which need to be carefully weighed, especially surrounding clients such as Outlook/Thunderbird/MacMail coming out of message UID sync with the backend when all of the sudden historical mail is not available. One possibility is to offer a webmail-only emergency solution, though this is also not entirely ideal. We’ll keep you posted.

Is this reindexing you did the same or similar to what you did in August?

No, this was a file level reindex of the users mailbox. In August we had multiple hardware failures that resulted in hard-drives requiring a RAID level rebuild from parity.

Shouldn’t the system reindex itself?

Yes, it should, and once we put the Dovecot software patches in place the reindexing was successfully completed by the system itself. Cluster B and half of Cluster A were left to naturally reindex but we chose to take part of Cluster A offline entirely in order to perform a global reindex because of the sheer number of mailboxes which were affected in that specific portion of the Cluster. This reduced the time to restore service.

While mailboxes are evenly distributed, user access fluctuates and is not entirely predictable on a large scale. While access load does balance itself out in the grand scheme, there are times when some portions of the cluster are more used than others.

Why wouldn’t you have redundancies in place to avoid this?

There are many layers of redundancy currently in place on both the hardware and software fronts. Redundancy was not the solution in this particular case. We would have needed to have an order of magnitude more capacity/redundancy to be able to ‘weather’ this, and even then it likely would not have been enough. The best way to avoid this type of situation in the future is through early detection and I strongly believe we’re in a very good position to do that now.

Might splitting your architecture into smaller but more manageable systems be an option? (Augustin L)

Our current architecture evolved from that type of environment so we know firsthand the caveats of splitting the system into smaller sub-systems. Splitting up the Clusters further would, in fact, make them less manageable and less able to distribute load. During our time running the old platform we saw a lot of this and it translated to a much more unpredictable user experience. In short, smaller sub-systems do not scale well.

However, we have engineered in the benefits of smaller manageable systems, as evidenced by only a portion of the cluster being affected.

Summary of Recent Cluster A Email Service Issues

I’d like to provide further details about poor service we provided to many of our resellers on Cluster A of our Email Service last week.

As promised, we’re conducting a detailed post-mortem but I wanted to kick things off by providing you with some high-level analysis of what happened and what action we took.

We have prepared Incident Report #2993 – October 14, 2008 (260K PDF) as the first part of our analysis.

In the coming days, we’ll be addressing some of the deeper issues brought to light by this incident through an even more technical FAQ that is currently in the works.

As our CEO Elliot Noss expressed in his Open Letter, we’re very sorry this happened in the first place and we’re determined to do everything we can to make sure it doesn’t happen again. We want to thank you for the many words of constructive advice you have provided and we can assure you that we’ll be considering every suggestion.

In Elliot’s letter, he mentioned that in addition to dedicating ourselves to reliability, we are committed to taking other elements of our email service to a new level including: monitoring, change management, emergency protocols and procedures. In the coming weeks, we’ll be posting more about our plans. As always, we welcome your feedback.

Comparison of this incident with the August service interruption

Many of you have been asking us why we have had two outages on the same cluster within a period of three months. We wanted to clarify that this was NOT a reoccurrence of the same issue that caused the service interruption in August. I have published the incident report for the August incident below to allow you to compare, but to summarize briefly:

  1. The August outage was the result of a shelf controller hardware failure. After replacing the defective hardware, we had to rebuild the RAID groups. This process had to be completed in a consecutive manner, meaning that we could only bring mailstores back online one volume at a time. After that incident, we made architecture changes that would prevent a similar hardware failure that would cause a rebuild to be triggered. (Incident Report #1991 – August 18, 2008 (344k PDF))
  2. Last week’s degradation in service was caused by two separate issues (one in the underlying Linux kernel and one in the Dovecot mail server software) which caused corruption in the mail server indexes. This led to an abnormally high server load as users trying to connect received timeout messages and then tried to reconnect. The resulting logjam as all login slots were filled led to more timeouts and degraded service for about 40% of users on Cluster A (or about 20% of all Email Service users). It took us longer to diagnose because we had to rule out a hardware problem first. After that was confirmed, further investigations had to be completed at the same time as we were moving mailboxes to new hardware in an attempt to alleviate the high server loads. Once the problems were diagnosed, we were able to work with some of the top contributors from the Linux kernel and the Dovecot mail server open source communities to develop and apply patches as quickly as possible. Unfortunately, the second bug wasn’t discovered until we had completed reindexing the mailboxes after patching the first problem, leading to a longer than anticipated service disruption.

Once again I’d like to personally apologize for the inconvenience to you our Resellers and to your customers.

We’ll include more posts on this issue and our efforts to make sure it doesn’t happen again in the upcoming days and weeks.

Become a Reseller

Sign Up Now
 
 
Subscription Options
Archive