<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Technical Debrief on October Cluster A Email Service Issue</title>
	<atom:link href="http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=technical-debrief-on-october-cluster-a-email-service-issue</link>
	<description>Happenings at OpenSRS. Talk of Domain Names, Email and SSL</description>
	<lastBuildDate>Sat, 04 Feb 2012 11:59:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Ken Schafer</title>
		<link>http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/#comment-755</link>
		<dc:creator>Ken Schafer</dc:creator>
		<pubDate>Sun, 19 Oct 2008 18:58:05 +0000</pubDate>
		<guid isPermaLink="false">http://opensrs.com/index.php?option=com_wordpress&#038;p=554&#038;Itemid=149#comment-755</guid>
		<description>@gsyoungblood @3 - Not sure we can answer all your questions at this point, but thanks for raising them. I&#039;ll make sure the tech guys see them.

As to your final question, we&#039;ll be speaking directly to all affected Resellers about compensation directly - and very soon.</description>
		<content:encoded><![CDATA[<p>@gsyoungblood @3 &#8211; Not sure we can answer all your questions at this point, but thanks for raising them. I&#8217;ll make sure the tech guys see them.</p>
<p>As to your final question, we&#8217;ll be speaking directly to all affected Resellers about compensation directly &#8211; and very soon.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gsyoungblood</title>
		<link>http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/#comment-754</link>
		<dc:creator>gsyoungblood</dc:creator>
		<pubDate>Sun, 19 Oct 2008 18:34:01 +0000</pubDate>
		<guid isPermaLink="false">http://opensrs.com/index.php?option=com_wordpress&#038;p=554&#038;Itemid=149#comment-754</guid>
		<description>I certainly understand the likelihood that Dovecot had that particular bug for a while. That&#039;s not what I talking about.

At some point, one of the responses (or the video; I don&#039;t remember which at the moment), it was mentioned that someone had listed open NFS locks and found 23-25 thousand open locks (now that I think about it, it probably was the video). Finding those helped point to where the real problem was.

In my reading of your post, it seems to me that it was a combination of the Linux upgrade with the Dovecot bug that caused the problem. That Dovecot had the bug isn&#039;t the root of my questions/comments above.

So, the root cause (to me) is the Linux upgrade to 2.6.19 which has a problem working with NetApp filers using TCP NFS. [I won&#039;t comment on whether this is a Linux bug or NetApp bug.]  The Dovecot bug just exercised the underlying issue and raised the Dovecot problem to a higher level of visibility.

Also, at one point there was a mention of another Dovecot issue -- that it would write to index files after a timeout, thereby corrupting the file. I don&#039;t see that mentioned above (unless I missed it).

Regardless of the Dovecot issues, returning to the Linux/NetApp interaction. How much of that &quot;45 days&quot; of testing were done with Linux as an NFS client to NetApp filers simulating the workload that the mail servers generate?

You mention early detection would have been key to avoiding this issue and that now you are in a position to watch for this. Every failure is a learning opportunity and a chance to improve monitoring points and early warning signals for possible problems. My fear is that we have to keep experiencing outages for these data points to be found. I realize it&#039;s a virtually impossible task to watch for everything and that we don&#039;t know what we don&#039;t know, but at the same time given the description of the problems I keep coming back to there should have been early warning signs -- things like IO load increase on NetApps due to rebuilding indexes before there were so many as to cripple the box, the NFS locks mentioned, number of reindexes/rebuilds necessary, response time to certain requests, etc. All of these would seem to have been an early indicator that something was going on. Were these not being monitored before? If so, what&#039;s changed so that &quot;[you]&#039;re in a very good position to [have early detectio] now&quot; and why wasn&#039;t it there before?

For example, have you made the number of NFS locks a data point in your monitoring? Do you now measure response times? The number of logins that mysteriously go away?

And, what kind of monitoring system do you have in place? Hopefully one that provides early warnings based on threshholds and not one that reports a failure has occurred after-the-fact.

The last question I have is not one you can answer yet (I wouldn&#039;t think), though I think you should be able to answer soon. I would like to see the new and improved disaster recovery plan, and what it will take to trigger it, should there be a future outage that takes users offline. This disaster recovery plan should include 5 sections:
1. What will trigger implementation of the plan or phases of the plan.
2. What will be done to provide immediation access to new incoming/outgoing messages if archives are not available.
3. What will be done to prevent NetApp volume or massive Dovecot rebuilding/reindexing from delaying restoration of user services (i.e. if it takes 3+ days to rebuild volumes or reindex lots of mailboxes, that should NOT be 3+ additional days of outages for users)
4. If users are moved as part of an immediate restoration of service plan, how will archived email be integrated back into the folders (i.e. user accounts would be returned to the original cluster once operational and mail on the temporary cluster copied back to their original account, or whatever your solution involves).
5. Assuming complete cluster failure (but no data loss due existence of backups, etc.), how long should it take to restore service (first, immediate access to send and recieve new message; second, complete restoration of archived messages too). Presumably, partial cluster failure will take the same amount of time or less, and that should be indicated as well.

Finally, other than enoss&#039; willingness to put a face to the issue with our customers, what else will be done to compensate for the damages suffered?

Thanks again for the detailed answers here. While it doesn&#039;t lessen the sting caused from the outage, it does make up for some of the frustration about lack of information experienced during the outage and the apparant lack of specificity in the August outage.
–
OpenSRS Reseller/User group on LinkedIn.
http://www.linkedin.com/e/gis/1012737</description>
		<content:encoded><![CDATA[<p>I certainly understand the likelihood that Dovecot had that particular bug for a while. That&#8217;s not what I talking about.</p>
<p>At some point, one of the responses (or the video; I don&#8217;t remember which at the moment), it was mentioned that someone had listed open NFS locks and found 23-25 thousand open locks (now that I think about it, it probably was the video). Finding those helped point to where the real problem was.</p>
<p>In my reading of your post, it seems to me that it was a combination of the Linux upgrade with the Dovecot bug that caused the problem. That Dovecot had the bug isn&#8217;t the root of my questions/comments above.</p>
<p>So, the root cause (to me) is the Linux upgrade to 2.6.19 which has a problem working with NetApp filers using TCP NFS. [I won't comment on whether this is a Linux bug or NetApp bug.]  The Dovecot bug just exercised the underlying issue and raised the Dovecot problem to a higher level of visibility.</p>
<p>Also, at one point there was a mention of another Dovecot issue &#8212; that it would write to index files after a timeout, thereby corrupting the file. I don&#8217;t see that mentioned above (unless I missed it).</p>
<p>Regardless of the Dovecot issues, returning to the Linux/NetApp interaction. How much of that &#8220;45 days&#8221; of testing were done with Linux as an NFS client to NetApp filers simulating the workload that the mail servers generate?</p>
<p>You mention early detection would have been key to avoiding this issue and that now you are in a position to watch for this. Every failure is a learning opportunity and a chance to improve monitoring points and early warning signals for possible problems. My fear is that we have to keep experiencing outages for these data points to be found. I realize it&#8217;s a virtually impossible task to watch for everything and that we don&#8217;t know what we don&#8217;t know, but at the same time given the description of the problems I keep coming back to there should have been early warning signs &#8212; things like IO load increase on NetApps due to rebuilding indexes before there were so many as to cripple the box, the NFS locks mentioned, number of reindexes/rebuilds necessary, response time to certain requests, etc. All of these would seem to have been an early indicator that something was going on. Were these not being monitored before? If so, what&#8217;s changed so that &#8220;[you]&#8216;re in a very good position to [have early detectio] now&#8221; and why wasn&#8217;t it there before?</p>
<p>For example, have you made the number of NFS locks a data point in your monitoring? Do you now measure response times? The number of logins that mysteriously go away?</p>
<p>And, what kind of monitoring system do you have in place? Hopefully one that provides early warnings based on threshholds and not one that reports a failure has occurred after-the-fact.</p>
<p>The last question I have is not one you can answer yet (I wouldn&#8217;t think), though I think you should be able to answer soon. I would like to see the new and improved disaster recovery plan, and what it will take to trigger it, should there be a future outage that takes users offline. This disaster recovery plan should include 5 sections:<br />
1. What will trigger implementation of the plan or phases of the plan.<br />
2. What will be done to provide immediation access to new incoming/outgoing messages if archives are not available.<br />
3. What will be done to prevent NetApp volume or massive Dovecot rebuilding/reindexing from delaying restoration of user services (i.e. if it takes 3+ days to rebuild volumes or reindex lots of mailboxes, that should NOT be 3+ additional days of outages for users)<br />
4. If users are moved as part of an immediate restoration of service plan, how will archived email be integrated back into the folders (i.e. user accounts would be returned to the original cluster once operational and mail on the temporary cluster copied back to their original account, or whatever your solution involves).<br />
5. Assuming complete cluster failure (but no data loss due existence of backups, etc.), how long should it take to restore service (first, immediate access to send and recieve new message; second, complete restoration of archived messages too). Presumably, partial cluster failure will take the same amount of time or less, and that should be indicated as well.</p>
<p>Finally, other than enoss&#8217; willingness to put a face to the issue with our customers, what else will be done to compensate for the damages suffered?</p>
<p>Thanks again for the detailed answers here. While it doesn&#8217;t lessen the sting caused from the outage, it does make up for some of the frustration about lack of information experienced during the outage and the apparant lack of specificity in the August outage.<br />
–<br />
OpenSRS Reseller/User group on LinkedIn.<br />
<a href="http://www.linkedin.com/e/gis/1012737" rel="nofollow">http://www.linkedin.com/e/gis/1012737</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ken Schafer</title>
		<link>http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/#comment-753</link>
		<dc:creator>Ken Schafer</dc:creator>
		<pubDate>Sun, 19 Oct 2008 05:42:19 +0000</pubDate>
		<guid isPermaLink="false">http://opensrs.com/index.php?option=com_wordpress&#038;p=554&#038;Itemid=149#comment-753</guid>
		<description>@gsyoungblood - Thanks for the comment. Let me tackle a couple of things you mentioned that I&#039;ve got some background on.

It&#039;s quite likely that Dovecot has ALWAYS had this problem. It wasn&#039;t related to a specific upgrade and seems to have only caused problems when combined with the specific issues the NetApp/Linux interaction caused.

We have indeed made considerable cost reductions in our operations in the last year.  These did not come from skimping on the Email Service, rather then came from SHUTTING DOWN two older versions of the Email Service that some customers were running on until this summer.  Essentially we cleaned up huge inefficiencies and then reinvested a bunch into the new system, but we were still left with big savings.

Cheers,

Ken Schafer
VP, Product Management OpenSRS</description>
		<content:encoded><![CDATA[<p>@gsyoungblood &#8211; Thanks for the comment. Let me tackle a couple of things you mentioned that I&#8217;ve got some background on.</p>
<p>It&#8217;s quite likely that Dovecot has ALWAYS had this problem. It wasn&#8217;t related to a specific upgrade and seems to have only caused problems when combined with the specific issues the NetApp/Linux interaction caused.</p>
<p>We have indeed made considerable cost reductions in our operations in the last year.  These did not come from skimping on the Email Service, rather then came from SHUTTING DOWN two older versions of the Email Service that some customers were running on until this summer.  Essentially we cleaned up huge inefficiencies and then reinvested a bunch into the new system, but we were still left with big savings.</p>
<p>Cheers,</p>
<p>Ken Schafer<br />
VP, Product Management OpenSRS</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gsyoungblood</title>
		<link>http://www.opensrs.com/blog/2008/10/technical-debrief-on-october-cluster-a-email-service-issue/#comment-752</link>
		<dc:creator>gsyoungblood</dc:creator>
		<pubDate>Sat, 18 Oct 2008 22:52:17 +0000</pubDate>
		<guid isPermaLink="false">http://opensrs.com/index.php?option=com_wordpress&#038;p=554&#038;Itemid=149#comment-752</guid>
		<description>Thank you very much for going through these questions and answering them directly.

Now that you have a fairly unique and seemingly well defined failure scenario for Linux, NFS, and NetApp, will you be creating a test environment to simulate these kinds of work loads before making future upgrades?

I don&#039;t know if I&#039;d be that willing to let NetApp off the hook completely. 2.6.19 is a fairly old kernel. It seems that this type of interaction should be pretty well known by now for NetApp users working with Linux. That said, it&#039;s been a long time since I&#039;ve known exactly what changes were going into my kernel upgrades too, so I&#039;m not holding the fact that this slipped by against you.

I still think there is a weakness in your validation before upgraidng the dovecot servers. It seems like they have unique work load properties that were not tested for before the upgrade. Plus, the number of NFS locks were so large, it seems like those too could have been checked/caught/raised suspicion at some point before critical/catastrophic failure.

Scaling out as large as these systems are presents a unique challenge -- I know and fully understand that. That combined with the quality of service promised and expected also requires a fair amount of expense to have realistic lab environment(s) for testing/validating changes.

There have been comments about how you have managed to cut costs necessary to provide this service. I&#039;ve seen several places that look at ccutting lab/testing environments budgets early on as those areas seem to be more overhead than revenue generating.

It seems like some of the warning indicators should have seen in lab/testing before this was rolled in prod. If it was an honest mistake that these were missed, I can accept that. I&#039;m more concerned that it seems your testing of workloads similar to your mail environment was not adequate.

–
OpenSRS Reseller/User group on LinkedIn.
http://www.linkedin.com/e/gis/1012737</description>
		<content:encoded><![CDATA[<p>Thank you very much for going through these questions and answering them directly.</p>
<p>Now that you have a fairly unique and seemingly well defined failure scenario for Linux, NFS, and NetApp, will you be creating a test environment to simulate these kinds of work loads before making future upgrades?</p>
<p>I don&#8217;t know if I&#8217;d be that willing to let NetApp off the hook completely. 2.6.19 is a fairly old kernel. It seems that this type of interaction should be pretty well known by now for NetApp users working with Linux. That said, it&#8217;s been a long time since I&#8217;ve known exactly what changes were going into my kernel upgrades too, so I&#8217;m not holding the fact that this slipped by against you.</p>
<p>I still think there is a weakness in your validation before upgraidng the dovecot servers. It seems like they have unique work load properties that were not tested for before the upgrade. Plus, the number of NFS locks were so large, it seems like those too could have been checked/caught/raised suspicion at some point before critical/catastrophic failure.</p>
<p>Scaling out as large as these systems are presents a unique challenge &#8212; I know and fully understand that. That combined with the quality of service promised and expected also requires a fair amount of expense to have realistic lab environment(s) for testing/validating changes.</p>
<p>There have been comments about how you have managed to cut costs necessary to provide this service. I&#8217;ve seen several places that look at ccutting lab/testing environments budgets early on as those areas seem to be more overhead than revenue generating.</p>
<p>It seems like some of the warning indicators should have seen in lab/testing before this was rolled in prod. If it was an honest mistake that these were missed, I can accept that. I&#8217;m more concerned that it seems your testing of workloads similar to your mail environment was not adequate.</p>
<p>–<br />
OpenSRS Reseller/User group on LinkedIn.<br />
<a href="http://www.linkedin.com/e/gis/1012737" rel="nofollow">http://www.linkedin.com/e/gis/1012737</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

