Technical Debrief on October Cluster A Email Service Issue
Aj Mirani is Manager of Technical Operations for OpenSRS and is responsible for coordinating the strategic aspects of technical issues, long-term capacity planning and resource allocation for technical projects. He leads the team of Unix administrators and network administrators directly responsible for running the servers, network and storage devices across all platforms. Below he’s answered questions raised by resellers during last week’s service disruption.
Definitions
Linux: a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. OpenSRS uses the Linux operating system on all of our services.
Dovecot: an open source IMAP and POP3 server for Linux/Unix-like operating systems. OpenSRS uses the Dovecot mail server software across our Email platform.
NetApp: a storage and data management provider used by OpenSRS to manage our email service. This technology is used by many major service providers.
Bugs
Can you describe the Linux kernel bug that was found?
In 2.6.19, a patch was introduced to the Linux kernel by developer Neil Brown which added an “optimization” whereby if you have only TCP NFS mounts, the Linux kernel on the client will not listen for UDP NFS lock callback messages, as it was believed that an NFS server would always send lock messages over TCP if all mounts were also TCP.
However well-intentioned, this optimization does not hold true in all cases. For example, Netapp filers will still perform UDP NFS lock callback messages with their clients, even if they are using TCP NFS for all volume mounts.
This is not a bug on the part of Netapp, since there is no specification in the NFS RFCs that TCP mounts necessitate 100% TCP lock messages. Rather, it’s up to the NFS client to be available and listening to all NFS lock messages, whether TCP or UDP.
Moreover, Netapp explicitly chose to use UDP for some lock callback messages as the overhead on short messages of this type is significantly less in UDP than it is in TCP. In short, it scales better to do so for these messages.
The end result is that the NFS clients (IMAP servers in this case) were not able to perform full NFS locking all the time, and that resulted in clobbered writes, leading to corrupted indexes which necessitated full Dovecot index rebuilds.
Can you describe the bug that was found in Dovecot?
There were in fact two bugs found in Dovecot, both of which ultimately led to the same end result, that being a bad index file.
First, if a user logs in to their mailbox and Dovecot detects a problem with their index, it will attempt to reindex their messages. Should that user’s connection be closed for any reason, Dovecot will detect this but still continue with the reindexing if it has not yet completed it. This is good because, in terms of resource utilization, the reindex is a very expensive operation and Dovecot doesn’t want to have to do this more than necessary. The bug, ironically, is that once Dovecot went through the effort of reading every message in the user’s mailbox and it came time to actually write the index, a subroutine would detect that the user was no longer connected and abort the final write operation.
Second, the virtual size is the size of the message within the POP3 protocol, which can differ from the size of the message on disk. Under some conditions, Dovecot would do the work of rebuilding the table of virtual sizes, but would not ever write it out. The POP3 session would function normally from the user’s point of view, as the virtual size table was now in RAM, but the next time the user logged in, the virtual sizes would have to be reconstructed again, which caused a rescan of all mail in the mailbox.
Linux Questions
What distro and version of Linux are you using?
We have standardized on the Debian ‘stable’ release as a base with security patches. It has a proven track record with us and we like the rigorous prerequisites which need to be met before packages can be considered for this release. The following article describes the life-cycle of packages within Debian distributions prior to being considered for the ‘stable’ release:
Debian Package Life Cycle (Wikipedia)
During our initial architectural design of the mail platform, there was a lot of heated debate as to our selection of standard OS and hardware. In the end, after looking at all the other options, including Solaris and FreeBSD, we decided on Linux as the best candidate. We still believe this was the best choice not only from a reliability standpoint, but also from a performance perspective.
It sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage? (Greg Youngblood)
We are very cautious around any changes made to the production environment. Even small changes are made with a healthy amount of paranoia. Something as major as a kernel upgrade is not taken lightly and goes through a lot of scrutiny before reaching our production servers. Even when we consider an upgrade fit for production, we start with limited deployments on non-customer-impacting servers. As the upgrade proves itself reliable, we begin rolling it out to other environments. To give you an idea of how long this can take, this particular kernel spent about 45 days running problem-free across approximately 400 servers in our production environment prior to its final rollout to the Dovecot servers. This was after it spent significant time first in our development and subsequently in our QA environments. We do use NFS extensively across all of these environments. We specifically chose this release because of the major enhancements for NFS that it included. Unfortunately, even with rigorous testing, the reality is that in any environment, bugs do make it into production sometimes. We have many layers of protection in place to mitigate these. In this particular case, the combination of the problem with the kernel in conjunction with the Dovecot bug quickly pushed our load to literally ten times what we normally experience. We do make sure our systems have plenty of spare capacity and we can handle a lot of extra load. Unfortunately, not ten times regular capacity as we experienced last week.
Have you considered switching from Linux to Solaris? Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out. (Greg Youngblood)
This may be the case for NFSv4 at the moment. We don’t believe that NFSv4 is proven enough for us to use in production for now. From Netapp’s perspective, they strongly recommend using the ‘General Availability’ (GA) version of Ontap (the platform OS) if we are going to implement NFSv4. We are not comfortable using anything other than the ‘General Deployment’ (GD) releases however as these are the production-proven versions. Ontap GA is the equivalent of beta for Netapp and only has a very limited production deployment across their customer base. The GD releases are the most widely deployed, most stable versions. The short answer is that we won’t feel NFSv4 is suitable for production for at least another six to twelve months.
Dovecot Questions
Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there? (Greg Youngblood)
We worked with Timo Sirainen, author and primary maintainer of Dovecot, to patch the Dovecot source. These patches are currently in production across our production environment. Future versions of Dovecot will include these changes as standard.
If you implement bounces (preferably timeouts during delivery so messages stay in queues and don’t get lost assuming you can deliver them within 5-7 days), can you make it adjustable (if possible) with a setting in MAC or reseller interface? I can certainly understand why some would want bounce notifications, but probably not everyone will. I am on several lists that auto unsubscribe you on bounces, so for myself personally I prefer not to have them bounced. (Greg Youngblood)
Once we accept mail into our system, it will not bounce. Our architecture is tiered such that even if the underlying mail servers are unavailable, mail is queued internally until such time as it is delivered. Mail that is rejected does bounce at the perimeter only.
What can be done by OpenSRS about secondary MX records? Whilst emails are being queued by the primary MX, I presume that nothing would go to a secondary MX.
Currently the only way mail would go to the secondary MX is if the primary MX does not accept mail. Even through the worst SPAM attacks we have had sufficient capacity to still accept valid mail. We have yet to experience this situation.
System Architecture/Miscellaneous Questions
What type of monitoring system are you using? Why wasn’t this abnormal system behavior caught by your monitoring system in the early stages?
We are using a number of systems to trend and monitor. Primarily we use Nagios for monitoring with the help of numerous custom plugins we developed to provide a more robust testing suite. In Nagios alone we have in excess of 1500 monitoring points across the Email platform. Furthermore, we use Munin to trend additional metrics for long term planning and visibility into scalability trends.
On a micro-scale of say a few thousand users, these bugs were individually negligible (and extremely difficult to detect.) The user experience would have been normal during this period. It took many days of digging by an assembled team of our top engineers and system administrators working in shifts 24/7, in conjunction with some of the best developers in the Open Source world and Tier 3 NetApp engineers in order to nail this down.
The cumulative nature of the kernel locking problem combined with the Dovecot reindexing bug, both compounded by millions of users accessing the system, was the confluence of issues that caused the outage.
As a result of this incident we have added a number of additional monitoring points that will add insight and provide an early warning in the future. While it would be almost impossible to detect the specific failure, we’re in a very good position to detect and track things like locks and other NFS/Netapp interactions.
Why wasn’t Cluster B affected? (Greg Youngblood)
Cluster B, much like the half of Cluster A which remained online was, in fact, affected. Both Clusters use identical hardware and software. We were able to resolve the issue before we crossed the cascade threshold on those environments.
Why didn’t you move our mailboxes to Cluster B? (Edward Gore)
The short answer to this is that it would take well over a week to migrate every user from Cluster A to Cluster B. We are not able to disclose the exact amount of space being used by mailboxes, but it is measured in many terabytes. Cluster B resides in an entirely separate geographically different data center, so we would be limited to Internet transfer speeds between these two data center providers in order to conduct such a migration.
Saying ’40% of clients of Cluster A’ is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – via an API would be useful, we could then work with our clients who ARE affected! (Paul O’Hanlon)
Our hashing algorithm that distributes users across the cluster is highly efficient. We have found that it very evenly distributes users on a large scale. It is possible for us to provide lists of users on affected mailstores although it is unlikely it will be a feature that will be implemented into the API. In the case of last week’s incident, producing those reports would have pulled our engineers away from resolving the root cause and restoring mailboxes.
After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a backup system where empty mailboxes would have already been created in advance. Have you considered this idea? (Augustin L)
We have considered this and are still investigating options. There are some considerations which need to be carefully weighed, especially surrounding clients such as Outlook/Thunderbird/MacMail coming out of message UID sync with the backend when all of the sudden historical mail is not available. One possibility is to offer a webmail-only emergency solution, though this is also not entirely ideal. We’ll keep you posted.
Is this reindexing you did the same or similar to what you did in August?
No, this was a file level reindex of the users mailbox. In August we had multiple hardware failures that resulted in hard-drives requiring a RAID level rebuild from parity.
Shouldn’t the system reindex itself?
Yes, it should, and once we put the Dovecot software patches in place the reindexing was successfully completed by the system itself. Cluster B and half of Cluster A were left to naturally reindex but we chose to take part of Cluster A offline entirely in order to perform a global reindex because of the sheer number of mailboxes which were affected in that specific portion of the Cluster. This reduced the time to restore service.
While mailboxes are evenly distributed, user access fluctuates and is not entirely predictable on a large scale. While access load does balance itself out in the grand scheme, there are times when some portions of the cluster are more used than others.
Why wouldn’t you have redundancies in place to avoid this?
There are many layers of redundancy currently in place on both the hardware and software fronts. Redundancy was not the solution in this particular case. We would have needed to have an order of magnitude more capacity/redundancy to be able to ‘weather’ this, and even then it likely would not have been enough. The best way to avoid this type of situation in the future is through early detection and I strongly believe we’re in a very good position to do that now.
Might splitting your architecture into smaller but more manageable systems be an option? (Augustin L)
Our current architecture evolved from that type of environment so we know firsthand the caveats of splitting the system into smaller sub-systems. Splitting up the Clusters further would, in fact, make them less manageable and less able to distribute load. During our time running the old platform we saw a lot of this and it translated to a much more unpredictable user experience. In short, smaller sub-systems do not scale well.
However, we have engineered in the benefits of smaller manageable systems, as evidenced by only a portion of the cluster being affected.
-
gsyoungblood
-
Ken Schafer
-
gsyoungblood
-
Ken Schafer
