OpenSRS: Reseller Friendly since 1999
 

Archive for: October, 2008

OpenSRS and Open Source

Rick Yazwinski is Principal Engineer at OpenSRS. In this role, he is in charge of all technology decisions and strategy for the company.

Here at Tucows we’re strong supporters of the open source movement and community. We have been standardized on Debian Linux for years. We develop in the LAMP stack (Linux, Apache, MySql, Perl), with some Ruby, Python, and Java thrown in for color. We use open source components in our email platform along with homegrown components.

Why?

Experience has brought us here. Let me explain.

We’ve been down the proprietary software path. Our experience, like many others, has been mixed: nothing hurts more than having a vendor shrug off a critical problem saying they can’t reproduce it, point a finger at “the network” or some other component, or tell you that before they’ll help you you have to upgrade your server farm to the newest minor release because “it may fix the problem”.

Open source projects that have been around for a while just plain work, and work well. Further, when they don’t work well, the depth of understanding “out there” is huge AND when required you can get into the code and find any issues and fix or extend the code base. Unlike proprietary software packages, there are many courses of action you can take when you have a problem.

We aren’t alone in our belief in open source. So much of the Internet is built on open source software, it’s hard to contemplate running an Internet property or service without relying on it. Facebook, Google, Yahoo, Flickr, to name just a few; all of them support open source communities and believe in open source products.

In light of our recent issue, we remain committed to open source. It has served us well for many years and will continue to do so. We are now actually closer to many of the best sources for some parts of the stack, putting us in a better position for the future in terms of getting information and recommendations from some of the most knowledgeable people in the world.

Technical Debrief on October Cluster A Email Service Issue

Aj Mirani is Manager of Technical Operations for OpenSRS and is responsible for coordinating the strategic aspects of technical issues, long-term capacity planning and resource allocation for technical projects. He leads the team of Unix administrators and network administrators directly responsible for running the servers, network and storage devices across all platforms. Below he’s answered questions raised by resellers during last week’s service disruption.

Definitions

Linux: a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. OpenSRS uses the Linux operating system on all of our services.

Dovecot: an open source IMAP and POP3 server for Linux/Unix-like operating systems. OpenSRS uses the Dovecot mail server software across our Email platform.

NetApp: a storage and data management provider used by OpenSRS to manage our email service. This technology is used by many major service providers.

Bugs

Can you describe the Linux kernel bug that was found?

In 2.6.19, a patch was introduced to the Linux kernel by developer Neil Brown which added an “optimization” whereby if you have only TCP NFS mounts, the Linux kernel on the client will not listen for UDP NFS lock callback messages, as it was believed that an NFS server would always send lock messages over TCP if all mounts were also TCP.

However well-intentioned, this optimization does not hold true in all cases. For example, Netapp filers will still perform UDP NFS lock callback messages with their clients, even if they are using TCP NFS for all volume mounts.

This is not a bug on the part of Netapp, since there is no specification in the NFS RFCs that TCP mounts necessitate 100% TCP lock messages. Rather, it’s up to the NFS client to be available and listening to all NFS lock messages, whether TCP or UDP.

Moreover, Netapp explicitly chose to use UDP for some lock callback messages as the overhead on short messages of this type is significantly less in UDP than it is in TCP. In short, it scales better to do so for these messages.

The end result is that the NFS clients (IMAP servers in this case) were not able to perform full NFS locking all the time, and that resulted in clobbered writes, leading to corrupted indexes which necessitated full Dovecot index rebuilds.

Can you describe the bug that was found in Dovecot?

There were in fact two bugs found in Dovecot, both of which ultimately led to the same end result, that being a bad index file.

First, if a user logs in to their mailbox and Dovecot detects a problem with their index, it will attempt to reindex their messages. Should that user’s connection be closed for any reason, Dovecot will detect this but still continue with the reindexing if it has not yet completed it. This is good because, in terms of resource utilization, the reindex is a very expensive operation and Dovecot doesn’t want to have to do this more than necessary. The bug, ironically, is that once Dovecot went through the effort of reading every message in the user’s mailbox and it came time to actually write the index, a subroutine would detect that the user was no longer connected and abort the final write operation.

Second, the virtual size is the size of the message within the POP3 protocol, which can differ from the size of the message on disk. Under some conditions, Dovecot would do the work of rebuilding the table of virtual sizes, but would not ever write it out. The POP3 session would function normally from the user’s point of view, as the virtual size table was now in RAM, but the next time the user logged in, the virtual sizes would have to be reconstructed again, which caused a rescan of all mail in the mailbox.

Linux Questions

What distro and version of Linux are you using?

We have standardized on the Debian ‘stable’ release as a base with security patches. It has a proven track record with us and we like the rigorous prerequisites which need to be met before packages can be considered for this release. The following article describes the life-cycle of packages within Debian distributions prior to being considered for the ‘stable’ release:

Debian Package Life Cycle (Wikipedia)

During our initial architectural design of the mail platform, there was a lot of heated debate as to our selection of standard OS and hardware. In the end, after looking at all the other options, including Solaris and FreeBSD, we decided on Linux as the best candidate. We still believe this was the best choice not only from a reliability standpoint, but also from a performance perspective.

It sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage? (Greg Youngblood)

We are very cautious around any changes made to the production environment. Even small changes are made with a healthy amount of paranoia. Something as major as a kernel upgrade is not taken lightly and goes through a lot of scrutiny before reaching our production servers. Even when we consider an upgrade fit for production, we start with limited deployments on non-customer-impacting servers. As the upgrade proves itself reliable, we begin rolling it out to other environments. To give you an idea of how long this can take, this particular kernel spent about 45 days running problem-free across approximately 400 servers in our production environment prior to its final rollout to the Dovecot servers. This was after it spent significant time first in our development and subsequently in our QA environments. We do use NFS extensively across all of these environments. We specifically chose this release because of the major enhancements for NFS that it included. Unfortunately, even with rigorous testing, the reality is that in any environment, bugs do make it into production sometimes. We have many layers of protection in place to mitigate these. In this particular case, the combination of the problem with the kernel in conjunction with the Dovecot bug quickly pushed our load to literally ten times what we normally experience. We do make sure our systems have plenty of spare capacity and we can handle a lot of extra load. Unfortunately, not ten times regular capacity as we experienced last week.

Have you considered switching from Linux to Solaris? Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out. (Greg Youngblood)

This may be the case for NFSv4 at the moment. We don’t believe that NFSv4 is proven enough for us to use in production for now. From Netapp’s perspective, they strongly recommend using the ‘General Availability’ (GA) version of Ontap (the platform OS) if we are going to implement NFSv4. We are not comfortable using anything other than the ‘General Deployment’ (GD) releases however as these are the production-proven versions. Ontap GA is the equivalent of beta for Netapp and only has a very limited production deployment across their customer base. The GD releases are the most widely deployed, most stable versions. The short answer is that we won’t feel NFSv4 is suitable for production for at least another six to twelve months.

Dovecot Questions

Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there? (Greg Youngblood)

We worked with Timo Sirainen, author and primary maintainer of Dovecot, to patch the Dovecot source. These patches are currently in production across our production environment. Future versions of Dovecot will include these changes as standard.

If you implement bounces (preferably timeouts during delivery so messages stay in queues and don’t get lost assuming you can deliver them within 5-7 days), can you make it adjustable (if possible) with a setting in MAC or reseller interface? I can certainly understand why some would want bounce notifications, but probably not everyone will. I am on several lists that auto unsubscribe you on bounces, so for myself personally I prefer not to have them bounced. (Greg Youngblood)

Once we accept mail into our system, it will not bounce. Our architecture is tiered such that even if the underlying mail servers are unavailable, mail is queued internally until such time as it is delivered. Mail that is rejected does bounce at the perimeter only.

What can be done by OpenSRS about secondary MX records? Whilst emails are being queued by the primary MX, I presume that nothing would go to a secondary MX.

Currently the only way mail would go to the secondary MX is if the primary MX does not accept mail. Even through the worst SPAM attacks we have had sufficient capacity to still accept valid mail. We have yet to experience this situation.

System Architecture/Miscellaneous Questions

What type of monitoring system are you using? Why wasn’t this abnormal system behavior caught by your monitoring system in the early stages?

We are using a number of systems to trend and monitor. Primarily we use Nagios for monitoring with the help of numerous custom plugins we developed to provide a more robust testing suite. In Nagios alone we have in excess of 1500 monitoring points across the Email platform. Furthermore, we use Munin to trend additional metrics for long term planning and visibility into scalability trends.

On a micro-scale of say a few thousand users, these bugs were individually negligible (and extremely difficult to detect.) The user experience would have been normal during this period. It took many days of digging by an assembled team of our top engineers and system administrators working in shifts 24/7, in conjunction with some of the best developers in the Open Source world and Tier 3 NetApp engineers in order to nail this down.

The cumulative nature of the kernel locking problem combined with the Dovecot reindexing bug, both compounded by millions of users accessing the system, was the confluence of issues that caused the outage.

As a result of this incident we have added a number of additional monitoring points that will add insight and provide an early warning in the future. While it would be almost impossible to detect the specific failure, we’re in a very good position to detect and track things like locks and other NFS/Netapp interactions.

Why wasn’t Cluster B affected? (Greg Youngblood)

Cluster B, much like the half of Cluster A which remained online was, in fact, affected. Both Clusters use identical hardware and software. We were able to resolve the issue before we crossed the cascade threshold on those environments.

Why didn’t you move our mailboxes to Cluster B? (Edward Gore)

The short answer to this is that it would take well over a week to migrate every user from Cluster A to Cluster B. We are not able to disclose the exact amount of space being used by mailboxes, but it is measured in many terabytes. Cluster B resides in an entirely separate geographically different data center, so we would be limited to Internet transfer speeds between these two data center providers in order to conduct such a migration.

Saying ’40% of clients of Cluster A’ is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – via an API would be useful, we could then work with our clients who ARE affected! (Paul O’Hanlon)

Our hashing algorithm that distributes users across the cluster is highly efficient. We have found that it very evenly distributes users on a large scale. It is possible for us to provide lists of users on affected mailstores although it is unlikely it will be a feature that will be implemented into the API. In the case of last week’s incident, producing those reports would have pulled our engineers away from resolving the root cause and restoring mailboxes.

After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a backup system where empty mailboxes would have already been created in advance. Have you considered this idea? (Augustin L)

We have considered this and are still investigating options. There are some considerations which need to be carefully weighed, especially surrounding clients such as Outlook/Thunderbird/MacMail coming out of message UID sync with the backend when all of the sudden historical mail is not available. One possibility is to offer a webmail-only emergency solution, though this is also not entirely ideal. We’ll keep you posted.

Is this reindexing you did the same or similar to what you did in August?

No, this was a file level reindex of the users mailbox. In August we had multiple hardware failures that resulted in hard-drives requiring a RAID level rebuild from parity.

Shouldn’t the system reindex itself?

Yes, it should, and once we put the Dovecot software patches in place the reindexing was successfully completed by the system itself. Cluster B and half of Cluster A were left to naturally reindex but we chose to take part of Cluster A offline entirely in order to perform a global reindex because of the sheer number of mailboxes which were affected in that specific portion of the Cluster. This reduced the time to restore service.

While mailboxes are evenly distributed, user access fluctuates and is not entirely predictable on a large scale. While access load does balance itself out in the grand scheme, there are times when some portions of the cluster are more used than others.

Why wouldn’t you have redundancies in place to avoid this?

There are many layers of redundancy currently in place on both the hardware and software fronts. Redundancy was not the solution in this particular case. We would have needed to have an order of magnitude more capacity/redundancy to be able to ‘weather’ this, and even then it likely would not have been enough. The best way to avoid this type of situation in the future is through early detection and I strongly believe we’re in a very good position to do that now.

Might splitting your architecture into smaller but more manageable systems be an option? (Augustin L)

Our current architecture evolved from that type of environment so we know firsthand the caveats of splitting the system into smaller sub-systems. Splitting up the Clusters further would, in fact, make them less manageable and less able to distribute load. During our time running the old platform we saw a lot of this and it translated to a much more unpredictable user experience. In short, smaller sub-systems do not scale well.

However, we have engineered in the benefits of smaller manageable systems, as evidenced by only a portion of the cluster being affected.

Fall is a Busy Time: New Domains, Prices and More

When I sat down to put together another post for our blog, I quickly realized there was far too much going on both at OpenSRS and the industry to focus on one topic. As a result, here’s a recap of what’s taking place both at home and abroad.

Here at OpenSRS, it’s been a busy few weeks:

  • Our UK service is now running on EPP. What is EPP, you ask? It’s the protocol we use to communicate with most major registries, as it simplifies domain name provisioning and management. Until recently, UK did not offer an EPP solution; by moving to EPP, we expect our UK service to become faster and even more reliable.
  • .ME domain names can now be registered for one year.
  • We have several new promotions and price reductions: The registry fee for .ASIA domains has been cut in half, new .AT registrations are available for only $10, .TV names are only $25 and our .INFO promotion has been extended until the end of the year. You can view all of our domain name discounts here.
  • .INFO and .ORG registry fees are set to increase on November 1 and November 9, respectively. The .INFO registry fee will increase by $0.60 to $6.75; .ORG increases by $0.60 to $6.75. Note these fee increases are coming from the Registries themselves, and not OpenSRS.

A fair bit has also been happening outside of OpenSRS:

  • The .NAME registry has been acquired by VeriSign, the operator of many major registries including .COM, .NET and .TV. This is an interesting move because it affirms VeriSign’s belief in the personal web, something we’ve believed in for years. We expect little to change on a day-to-day basis, but it is worth noting that .NAME domain names have become increasingly popular among our resellers over the past 18 months.
  • The .TEL registry is preparing for launch. This is the newest gTLD to be introduced, another interesting take on how businesses and individuals might use domain names to manage their contact information and disparate websites. Look for more information on our plans to support the .TEL launch shortly.

2009 is fast approaching, but there’s still plenty of news and developments to come. ICANN’s 33rd International Public meeting takes place in a couple of weeks in Cairo, Egypt, and I expect we’ll see some more discount announcements before the year is over. As always, we’ll keep you informed as news develops.

Thanks to Flickr user CTD 2005 for the fall foliage photo and for releasing it under Creative Commons.

Summary of Recent Cluster A Email Service Issues

I’d like to provide further details about poor service we provided to many of our resellers on Cluster A of our Email Service last week.

As promised, we’re conducting a detailed post-mortem but I wanted to kick things off by providing you with some high-level analysis of what happened and what action we took.

We have prepared Incident Report #2993 – October 14, 2008 (260K PDF) as the first part of our analysis.

In the coming days, we’ll be addressing some of the deeper issues brought to light by this incident through an even more technical FAQ that is currently in the works.

As our CEO Elliot Noss expressed in his Open Letter, we’re very sorry this happened in the first place and we’re determined to do everything we can to make sure it doesn’t happen again. We want to thank you for the many words of constructive advice you have provided and we can assure you that we’ll be considering every suggestion.

In Elliot’s letter, he mentioned that in addition to dedicating ourselves to reliability, we are committed to taking other elements of our email service to a new level including: monitoring, change management, emergency protocols and procedures. In the coming weeks, we’ll be posting more about our plans. As always, we welcome your feedback.

Comparison of this incident with the August service interruption

Many of you have been asking us why we have had two outages on the same cluster within a period of three months. We wanted to clarify that this was NOT a reoccurrence of the same issue that caused the service interruption in August. I have published the incident report for the August incident below to allow you to compare, but to summarize briefly:

  1. The August outage was the result of a shelf controller hardware failure. After replacing the defective hardware, we had to rebuild the RAID groups. This process had to be completed in a consecutive manner, meaning that we could only bring mailstores back online one volume at a time. After that incident, we made architecture changes that would prevent a similar hardware failure that would cause a rebuild to be triggered. (Incident Report #1991 – August 18, 2008 (344k PDF))
  2. Last week’s degradation in service was caused by two separate issues (one in the underlying Linux kernel and one in the Dovecot mail server software) which caused corruption in the mail server indexes. This led to an abnormally high server load as users trying to connect received timeout messages and then tried to reconnect. The resulting logjam as all login slots were filled led to more timeouts and degraded service for about 40% of users on Cluster A (or about 20% of all Email Service users). It took us longer to diagnose because we had to rule out a hardware problem first. After that was confirmed, further investigations had to be completed at the same time as we were moving mailboxes to new hardware in an attempt to alleviate the high server loads. Once the problems were diagnosed, we were able to work with some of the top contributors from the Linux kernel and the Dovecot mail server open source communities to develop and apply patches as quickly as possible. Unfortunately, the second bug wasn’t discovered until we had completed reindexing the mailboxes after patching the first problem, leading to a longer than anticipated service disruption.

Once again I’d like to personally apologize for the inconvenience to you our Resellers and to your customers.

We’ll include more posts on this issue and our efforts to make sure it doesn’t happen again in the upcoming days and weeks.

Open Letter To Our Email Service Resellers

Dear Resellers,

I am writing today to speak to you directly about what happened this week with Cluster A of our Email Service. This will not refer to specific elements of the outage, there are other venues for that. The things I most want to communicate are my deep sorrow, why it won’t happen again and what we will do for you.

More than anything one thought keeps going through my mind as I think about this, the future determines the past. I will return to this thought.

First, and most importantly, we are sorry. I am sorry. I have been in this business a long time and do not know if I have ever been more sad about what we have done to you, to your customers and to how people think about us. An email outage in 1995 was different from one in 2000 and even more different from one in 2008. I know what this does to your reputations, to your customers and to your staff – and I and so many people here are just sad about that.

While it seems trite right now, we really define ourselves by how we make it easier for you in your businesses and with your customers and in our deep understanding of those relationships. That means the pain here is that much greater and believe me I know our pain here does not matter, yours does. Just know we are grieving.

Second, what will we do about it and why will this never happen again? I know for some of you that doesn’t matter, you are done with us, but I want to express this for the rest of you. Let me start here with things that were not the problem, old equipment, people, capacity or redundancy. The equipment is new, the people are great, we have plenty of capacity and redundancy. What this will mean for us is clearly the need to take the other elements of the service to a completely new level. Here I mean monitoring, change management, emergency protocols and procedures and operating efficiencies.

We had decided long before this that the most important part of email was reliability, not features, not groupware, not web 2.0 integration but reliability and deliverability. I have been at this a long time and really believe that these people and this service can be the best in the world, better than Google, Yahoo or Microsoft and most importantly the best partner for service providers. We owe you this and will deliver it.

Lastly, what we will do for you as a result of this? Let me start here by saying two things, we will certainly be doing something and that there is nothing we can do that will make up for your loss of reputation in your customers’ eyes. We know that. The people who will participate in that decision are fried right now, as I know even in your anger you can well imagine. I will ask your indulgence that you give us this week to make our plan in this regard.

There is one thing that I can offer now. I would like to make myself personally available to any of you who would like me to either reach out to your customers, or to any specific customer, with a letter, an email or a phone call. I know this will not often matter but perhaps in a few cases it might. My message here would be simple, this was our fault not yours and while you are responsible for the suppliers you pick, you had good reason to pick us and it was us who let you down. This offer stands whether you are leaving or staying.

In closing, the future determines the past. If we move forward and run the most reliable, service-provider focused, email service the world has ever seen this will be remembered as the few days that turned it around, as being a very important event in forging out mutual future. If we have no change in reliability or in service levels this will barely be remembered. It will just be a point on a mediocre line. I will do everything in my power to make it the former not the latter.

Regards,

Elliot Noss

Become a Reseller

Sign Up Now
 
 
Subscription Options
Archive