Status Update on the Community

JohnG · 11-13-2009, 12:17 AM

Our apologies for the unexpected downtime today. What follows is a technical service update for those of you who are interested.

At approximately 3:00pm ET, we were notified by a log service on one of the database servers that one of the drives was experiencing sudden, unexpected failures. In an attempt to recover the drive, it failed at 4:30pm ET.

It was not a catastrophic failure, however, so we were able to make an additional backup of the most recent data from all forums (otherwise the data would've come from a 4-day old backup). This took about an hour.

We then shutdown the server and replaced the faulty drive. A full OS reinstall took about 2 hours, and the restore of all backup data and cleaning of the database tables took about another hour and a half.

As you know, we take our uptime here very seriously and worked immediately on resolving this issue as soon as the drive started failing. Given the way it failed, we were fortunate to not suffer from any data loss.

We are back up and running 100%. Please let me know if you notice any problems or unexpected errors on the forums. Thank you.

Again, we apologize for the unexpected downtime and appreciated your patience.

Best,
JohnG

gbrandwood · 11-13-2009, 08:53 AM

I didn't notice the downtime. The fact that you responded so promptly is good news for all of us and I really appreciate you keeping my favourite chat site on-line. Thanks! And thanks for letting us all know the details.

But, since the rebuild, I note that my reputation has dropped down to only three bars. I'm sure I had at least a dozen prior to the failure....

Well done to all involved.

Joushou · 11-13-2009, 10:29 AM

Quote:

Originally Posted by gbrandwood

I didn't notice the downtime. The fact that you responded so promptly is good news for all of us and I really appreciate you keeping my favourite chat site on-line. Thanks! And thanks for letting us all know the details.

But, since the rebuild, I note that my reputation has dropped down to only three bars. I'm sure I had at least a dozen prior to the failure....

Well done to all involved.

Good, so i'm not the only one suddenly, ahem, "losing" points!

emrnyc · 11-13-2009, 11:12 PM

Good Work.... and Thank You

Quote:

Originally Posted by JohnG

Our apologies for the unexpected downtime today. What follows is a technical service update for those of you who are interested.

At approximately 3:00pm ET, we were notified by a log service on one of the database servers that one of the drives was experiencing sudden, unexpected failures. In an attempt to recover the drive, it failed at 4:30pm ET.

It was not a catastrophic failure, however, so we were able to make an additional backup of the most recent data from all forums (otherwise the data would've come from a 4-day old backup). This took about an hour.

We then shutdown the server and replaced the faulty drive. A full OS reinstall took about 2 hours, and the restore of all backup data and cleaning of the database tables took about another hour and a half.

As you know, we take our uptime here very seriously and worked immediately on resolving this issue as soon as the drive started failing. Given the way it failed, we were fortunate to not suffer from any data loss.

We are back up and running 100%. Please let me know if you notice any problems or unexpected errors on the forums. Thank you.

Again, we apologize for the unexpected downtime and appreciated your patience.

Best,
JohnG

SnickleFritz · 11-14-2009, 11:23 PM

Quote:

Originally Posted by gbrandwood

But, since the rebuild, I note that my reputation has dropped down to only three bars. I'm sure I had at least a dozen prior to the failure....

Feel'n your pain!

Bob.Kerns · 11-15-2009, 06:17 AM

Quote:

Originally Posted by JohnG

Our apologies for the unexpected downtime today. What follows is a technical service update for those of you who are interested.

At approximately 3:00pm ET, we were notified by a log service on one of the database servers that one of the drives was experiencing sudden, unexpected failures. In an attempt to recover the drive, it failed at 4:30pm ET.

It was not a catastrophic failure, however, so we were able to make an additional backup of the most recent data from all forums (otherwise the data would've come from a 4-day old backup). This took about an hour.

We then shutdown the server and replaced the faulty drive. A full OS reinstall took about 2 hours, and the restore of all backup data and cleaning of the database tables took about another hour and a half.

As you know, we take our uptime here very seriously and worked immediately on resolving this issue as soon as the drive started failing. Given the way it failed, we were fortunate to not suffer from any data loss.

We are back up and running 100%. Please let me know if you notice any problems or unexpected errors on the forums. Thank you.

Again, we apologize for the unexpected downtime and appreciated your patience.

Best,
JohnG

John, I'm not one to criticize a volunteer effort. On the contrary, thank you, Frank, and anyone else involved in making all this possible!

However (you knew there had to be one, right?), 4-day-old backups, full OS reinstalls, restores, etc. are quite a bit less reliability than what is now achievable, given manpower, expertise, and a bit of money. This sort of recovery can be done in minutes for something operating in the Amazon EC2 cloud, for example. Frequent incremental backups can speed backup and reduce the interval between backups, reducing data loss in the event of a catastrophe.

So my question is, is it worth discussing ways to improve the situation? Reduce the risk of data loss, and/or reduce the load on the volunteer administrators in the event of a problem?

I imagine I'm not the only one here with some expertise in the area.

I don't know the trade-offs here between manpower and money

Gihgehls · 11-15-2009, 02:10 PM

In my line of work we never allow a single failed drive to take down an entire machine, be it desktop or server. I'd be happy to discuss any ways to improve the reliability of the site.

JohnG · 11-17-2009, 11:14 AM

Always open to suggestions. Generalized ideas about how things could be better run are always nice for a read, but get me to specific strategies you'd recommend that are cost-effective and I'm listening.

Amazon EC2 is not something that I found particularly affordable or easy to implement, and it's not exactly had a stellar track record in terms of downtime so far. We do run Amazon S3 cloud services for static content, but for db operations, I'm not convinced it's there yet. Happy to be shown otherwise.

John

Bob.Kerns · 11-17-2009, 01:56 PM

Quote:

Originally Posted by JohnG

Always open to suggestions. Generalized ideas about how things could be better run are always nice for a read, but get me to specific strategies you'd recommend that are cost-effective and I'm listening.

Amazon EC2 is not something that I found particularly affordable or easy to implement, and it's not exactly had a stellar track record in terms of downtime so far. We do run Amazon S3 cloud services for static content, but for db operations, I'm not convinced it's there yet. Happy to be shown otherwise.

John

I hesitate to throw a lot of specific ideas at you, because I don't know how you have things implemented now, nor your budget, current costs, available hardware, and perhaps most importantly, the relative importance of saving $$$ vs saving routine time, vs uptime, vs risk of losing data, vs risk of long recovery times.

Nor do I know your total traffic, the compute demands of the forum software, nor the load on the DB back-end.

However, let me throw one idea out there. Amazon EC2 isn't the only approach or idea, but let me pick a hybrid owned/EC2 scenario as my example.

Let's say you run your system on your own hardware, two boxes, one the front-end, one the back-end. Let's say you don't want to spend a lot of money, but you'd like to reduce downtime and risk of data loss, while making recovery be a low-stress operation.

So, one scenario would be to turn your OS images into an AWS AMI (Amazon Machine Image). To do this, you'd first separate your live data (including logs) and your OS, application, configuration, etc. Only the non-live stuff would be part of the AMI. The rest would live on another volume. This volume would then be replicated to an Amazon EBS volume.

Initially, your current configuration is master and live.

You also set up an S3 volume to receive database logs. This gets all the DB changes as they're made, and serves as your hot backup for the data itself.

You modify your DB AMI with a startup script that slurps the logs from the S3 volume, saves them for recovery purposes, and then goes live.

Then set up to launch your DB EC2 instance periodically, to slurp those logs, and then shut down once it has reached the live state.

You'll spend only a few bucks getting this set up -- you'll probably spend more on caffeine while you do it. (There's a bit of a learning curve).

Now, to recover, you just launch one or both of your EC2 instances. If you just do the back-end, you reconfigure the front-end to talk to the EC2 back-end. You can set it up with a VPN to be able to do the front-end and not the back-end, but I don't recommend it. You'll incur more IO charges.

Switch over your DNS, and you're back on the air. Take a full dump of your DB, and start spooling logs to an S3 volume, and you're now set with your local installation as the redundant piece. When you've recovered your local hardware, reverse the process, and shut down the EC2 instances.

Costs? Very low in normal operation -- mostly just the S3 storage and IO for the logs, plus a few bucks/month for the AMI and EBS storage. Maybe $1/mo for the periodic boots of the DB server.

That would jump considerably when you go live to the EC2 instances. I don't know how much IO you incur, so I'm going to make a wild guess, maybe $200-$500/mo. Assume you take your time recovering, order a new hard drive, and return to normal operation after a week. The cost of the outage would be $50-$125 plus the cost of the new hard drive, a few minutes of your time for the switchover, and whatever time you'd spend anyway on recovering your system. But you'd be able to do it without the downtime and accompanying pressure.

I spend about $75/mo on my personal setup, which I run 24/7, but I don't have a lot of IO, and only one EC2 instance.

As for reliability -- I don't have enough reliability data to fully address your concern. Anecdotally -- I haven't seen any EC2 outages yet, in several months of operation, nor on their status pages. But the key to EC2 is that if your instance fails, you can just launch a new one. You can snapshot your EBS volumes, and re-slurp your database logs, so even if you lose an EBS volume, you can recover quickly.

The big cost I see is the time and learning up front.

A big benefit is that you're protected against all forms of onsite failure.

There are approaches to improving reliability without going to the cloud. They involve more hardware, so more up-front out-of-pocket costs, but perhaps less setup time and learning, and perhaps less operating costs.

But Amazon has the advantage on operating costs -- they can operate the same hardware for less than you can. And buy it for less, too. You tend to win if you have surplus hardware lying around, and cheap electricity, and already enough bandwidth, rack space, etc.

Amazon also has a new Relational Data Service, which is basically MySQL that they run for you. I haven't explored it yet, so I don't know if it would be a viable alternative to running your own MySQL instance. But if you run MySQL, you could consider setting up an RDS slave to mirror your data.

Does any of this sound like it might be helpful?

JohnG · 11-17-2009, 05:58 PM

Thanks for specifics -- far more helpful than a generalized comment.

Much there that I can investigate further to examine how it might work in our current setup. Everything is always a cost+time/benefit ratio. Keeping in mind, too, this is a hobbyist website.

Despite the recent incident, we still run at 99.999% reliability and this was the first significant downtime in 4 years with zero data loss. Whenever there's zero data loss, I'm a happy camper.

John

11-13-2009, 12:17 AM	#1
JohnG Uber Administrator Wise Segway Elder Join Date: Sep 2002 Location: Greater Boston Posts: 6,996	Status Update on the Community Our apologies for the unexpected downtime today. What follows is a technical service update for those of you who are interested. At approximately 3:00pm ET, we were notified by a log service on one of the database servers that one of the drives was experiencing sudden, unexpected failures. In an attempt to recover the drive, it failed at 4:30pm ET. It was not a catastrophic failure, however, so we were able to make an additional backup of the most recent data from all forums (otherwise the data would've come from a 4-day old backup). This took about an hour. We then shutdown the server and replaced the faulty drive. A full OS reinstall took about 2 hours, and the restore of all backup data and cleaning of the database tables took about another hour and a half. As you know, we take our uptime here very seriously and worked immediately on resolving this issue as soon as the drive started failing. Given the way it failed, we were fortunate to not suffer from any data loss. We are back up and running 100%. Please let me know if you notice any problems or unexpected errors on the forums. Thank you. Again, we apologize for the unexpected downtime and appreciated your patience. Best, JohnG __________________ -- An original Segway employee, 2001-2005

11-13-2009, 08:53 AM	#2
gbrandwood Advanced Member Join Date: Nov 2004 Location: North west England, UK. Posts: 3,043	I didn't notice the downtime. The fact that you responded so promptly is good news for all of us and I really appreciate you keeping my favourite chat site on-line. Thanks! And thanks for letting us all know the details. But, since the rebuild, I note that my reputation has dropped down to only three bars. I'm sure I had at least a dozen prior to the failure.... Well done to all involved. __________________ Gareth Brandwood The comments posted are made by the fat figners of the individual and do not necessarily represent the views of the brain.

11-15-2009, 02:10 PM	#7
Gihgehls Senior Member Join Date: May 2006 Location: Galactic Sector ZZ9 Plural Z Alpha Posts: 2,086	In my line of work we never allow a single failed drive to take down an entire machine, be it desktop or server. I'd be happy to discuss any ways to improve the reliability of the site. __________________ To view links or images in signatures your post count must be 5 or greater. You currently have 0 posts. "...if you insist on being imprecise in use and unique in definition, you should hardly be surprised that your attempts at communication are poorly understood." -a wise man

11-17-2009, 11:14 AM	#8
JohnG Uber Administrator Wise Segway Elder Join Date: Sep 2002 Location: Greater Boston Posts: 6,996	Always open to suggestions. Generalized ideas about how things could be better run are always nice for a read, but get me to specific strategies you'd recommend that are cost-effective and I'm listening. Amazon EC2 is not something that I found particularly affordable or easy to implement, and it's not exactly had a stellar track record in terms of downtime so far. We do run Amazon S3 cloud services for static content, but for db operations, I'm not convinced it's there yet. Happy to be shown otherwise. John __________________ -- An original Segway employee, 2001-2005

11-17-2009, 05:58 PM	#10
JohnG Uber Administrator Wise Segway Elder Join Date: Sep 2002 Location: Greater Boston Posts: 6,996	Thanks for specifics -- far more helpful than a generalized comment. Much there that I can investigate further to examine how it might work in our current setup. Everything is always a cost+time/benefit ratio. Keeping in mind, too, this is a hobbyist website. Despite the recent incident, we still run at 99.999% reliability and this was the first significant downtime in 4 years with zero data loss. Whenever there's zero data loss, I'm a happy camper. John __________________ -- An original Segway employee, 2001-2005

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode