KFF Outage Over the weekend and part of today

Administrator · September 24, 2018, 08:10:15 PM

I am sorry for the outage of the board, but the issue came from my host company. I chose them because of their support and protection services. I certainly could get cheaper hosts. But I think they have learned a lot from the issues that they have had to deal with over the last several days. So I think right now is not the time to jump ship.

Here is a copy of one of the letters I received:

Hello Nancy,

While I was hoping to save some of this for the official RFO [Reason For Outage] - enough people are getting tremendously upset over this that I'm going to spell out what I can now - keeping in mind that I will provide more details when I can.

**What happened?**

First and foremost - this failure is not something that we planned on or expected. A server administrator, the most experienced administrator we have, made a big mistake. During some routine maintenance where they were supposed to perform a _file system trim_ they mistakenly performed a _block discard_.

**What does this mean?**

The server administrator essentially told our storage platform to drop all data rather than simply dropping data that had been marked as _deleted_ by our servers.

**Why is restoration taking so long?**

Initially we believed that only the primary operating system partition of the servers was damaged - so we worked to bring new machines online to connect to our storage to bring accounts back online. Had our initial belief been correct - we'd have been back online in a few hours at most.

As it turns out our local data was corrupted beyond repair - to the point that we could not even mount the file systems to attempt data recovery.

Normally we would rely on snapshots in our storage platform - simply mounting a snapshot from prior to the incident and booting servers back up. It would have taken minutes - if maybe an hour. We are not sure as of yet, and will need to investigate, but snapshots were disabled. I wish I could tell you why - and I wish I knew why - but we don't know yet and will have to look into it.

We are working to restore cPanel backups from our off-site backup server in Phoenix Arizona. While you would think the distance and connectivity was the issue - the real issue is the amount of I/O that backup server has available to it. While it is a robust server with 24 drives - it can only read so much data so fast. As these are high capacity spinning drives - they have limits on speed.

Our disaster recovery server is our **last resort** to restore client data and, as it stands, is the _only_ copy we have remaining of all client data - except that which has already been restored which is back to being stored in triplicate.

**What will you do to prevent this in the future?**

We have, as we've been working on this and running into issues getting things back online quickly, discussing what changes we need to make to ensure that this both doesn't happen again as well as that we can restore quicker in the future should the need arise. I will go into more detail about this once we are back online.

**We are sorry - we don't want you to be offline any more than you do.**

Personally I'm not going to be getting any sleep until every customer affected by this is back online. I wish I could snap my fingers and have everybody back online or that I could go into the past and make a couple of _minor_ changes that would have prevented this. I do wish, now that this has happened, that there was a quick and easy solution.

I understand you're upset / mad / angry / frustrated. Believe me - I am sitting here listening to each and every one of you about how upset you are - I know you're upset and I am sorry. We're human - and we make mistakes. In this case **thankfully** we do have a last resort disaster recovery that we can pull data from. There are _many_ providers that, having faced this many failures - a perfect storm so to speak - would have simply lost your data entirely.

This is the **first** major outage we've had in over a decade and while this is definitely major - our servers are online and we are actively working as quickly as possible to get all accounts restored and back online. For clarity - the bottleneck here is not a staffing issue. We evaluated numerous options to speed up the process and unfortunately short of copying the data off to faster disks - which we did try - there's nothing we can do to speed this up. The process of copying the data off to faster disks was going to take just as long, if not longer, than the restoration process is taking on it's own.

Once everybody is back online - and there are accounts coming online every minute - we will be performing a complete post-mortem on this and will be writing a clear and transparent Reason For Outage [RFO] which we will be making available to all clients.

I hope that you understand that while this restoration process is ongoing there really isn't much to report beyond, "Accounts are still being restored as quickly as possible." I wish there was some interesting update I could provide you like, "Suddenly things have sped up 100x!" but that's not the case.

I am personally doing my best to reach out to clients that have opened tickets are updated as to when their accounts are in the active restoration queue. While we do have thousands of accounts to restore - our disaster recovery system actually transfers data substantially faster with fewer simultaneous transfers. While it sounds counter-intuitive - we're actively watching the restoration processes and balancing the number of accounts being restored at once against the performance of the disaster recovery system to get as many people back online as quickly as possible.

Most sites are coming back online after restoration without issues, however, if once your account is restored you are still having issues - we are here to help. While we are quite overwhelmed by tickets like, "WHY IS THIS NOT UP YET!?!?!" "WHY ARE YOU DOWN SO LONG!?!??!!" "FIX THIS NOWWWW!" - we are still trying to wade through all of that to help those that have come back online and are having issues - as few and far between as it has been.

If you have any questions - we will definitely answer them - but please understand that while we're restoring accounts we're really trying to focus on the restoration of services as well as resolving issues for those that are already resolved.

Again - I am sorry for the trouble this is causing you - we definitely don't want you offline any more than you do and will have all services restored as quickly as we can.

Sincerely,

Michael Denney
MDDHosting LLC - Professional Web Hosting Solutions
https://www.mddhosting.com/ - Rate us @RateLobby!
Check out our blog and community forums!
Follow us on Twitter and Facebook!
This message was sent to you because of your active service(s) with MDDHosting, LLC.
This is not a marketing email (unless stated otherwise in the email) and it cannot be unsubscribed.
Marketing emails can be unsubscribed at this link.

MDDHosting, LLC | 5231 E State Road 144, Mooresville, IN 46158 United States

Â© MDDHosting, LLC. All rights reserved

The one thing I am relieved about is that it happened at the time of the year when the board is slowing down versus April-June or July when it is more active.

In case you are wondering the bill for the domain name and hosting comes due in March or April. Can't remember exactly. So it wasn't that I didn't pay the bill.

When communicating with members, I usually use the PM feature, but I need to make a list of at least the most active member's email addresses. I was at a loss how to communicate with you about the problem.

Sorry for the inconvenience.
Nancy

getthenet · September 25, 2018, 12:31:11 AM

Not a problem , Nancy . Wasn't your fault ! You do a great job keeping this forum going ! Thank You !

Fort Wisers · September 25, 2018, 08:10:03 AM

No problem at all, these things happen and as mentioned, out of your control. The board is well run IMHO.
Take care and thanks for all the hard work.

Greg · September 25, 2018, 08:29:47 AM

Don't worry Nancy, outages happen - what is important is they learn from and implement tighter controls and processes - from what I read, they seem to have had a perfect storm of 3 or 4 different issues or areas that need better attention.

This letter certainly seems open and transparent and their promise to deliver a more detailed RFO/post mortum once they are back online is promising.

thx
Greg

Oarin · September 25, 2018, 05:06:00 PM

Nancy, you've got nothing to feel bad about. We're lucky to have you and your work keeping the site here.

T-Bone · September 26, 2018, 12:40:18 PM

No worries. I did notice it was down, but that happens from time to time...nice to see no lasting damage or loss of content.

And like the others stated...thanks for your time and efforts supporting this board...

johnny walleye · October 07, 2018, 04:11:00 PM

THANKS FOR ALL YOUR HARD WORK AS ALWAYS