IT professionals talk about disaster recovery in response to the mega-storm. Plus: 10 lessons learned for the next natural disaster.
Also see: The Cloud Backup IT Project Center
Read about disaster recovery during Hurricane Sandy, or skip to this article's concluding section, 10 Disaster Recovery Lessons from Hurricane Sandy
A storm as massive as Hurricane Sandy challenges pretty much everything: business disaster recovery plans; the preparedness of even top IT professionals; the integrity of seawalls; the head-in-the-sand, anti-science beliefs of global-warming deniers . . . and the list goes on and on.
In the last dozen years or so, we’ve seen a number of major disasters that highlight the vital importance of a workable, well-thought-out disaster recovery (DR) plan. Yet after each disaster, it seems that businesses must go back to the drawing board and relearn the lessons they claimed to have mastered the last time around.
If Hurricane Katrina or the Fukushima earthquake didn’t drive home the point that IT backup sites must be geographically separate from primary sites, nothing will. However, after Hurricane Sandy hit, plenty of businesses realized the cost of having backup facilities less than an hour’s drive away. This is a major mistake, they realized after the fact.
Doing a post-mortem on Sandy feels like a textbook case of déjà vu all over again, but once you get past the lessons we should have learned by now, new and more subtle ones emerge. Here are some of the disaster recovery (DR) and business continuity (BC) stories we’ve gathered in the aftermath of Hurricane Sandy, as well as the key takeaways businesses have drawn from those experiences.
Huffington Post: Scrambling Before the Election
For the Huffington Post, the 2012 election was a peak event. Hordes of readers turn to HuffPost for political analysis and opinions. The recent Election Day promised to be one of its highest points for traffic and page views.
And then along came Sandy, slamming into New York City eight days before Election Day.
Soon after the storm hit, HuffPost’s New York City-based data center near Battery Park flooded, bringing down the site. As HuffPost’s IT team worked frantically to switch over to their backup site in Newark, NJ, they had to cope with even more failures.
HuffPost was seemingly well protected. Between New York and Newark were three separate data circuits, for failover and redundancy – but all three went down.
“We keep learning the same lessons over and over again after these disasters, don’t we?” said John Pavley, CTO of the Huffington Post.
As the team attempted to get the site back online in a Washington, D.C. data center, Pavley and his team realized that data replication and the re-synching of data stores always takes longer than you’d originally planned for. “You look at specs of the machines, the network. You look at cables, and do a little testing, but inevitably you’re way off.”
What HuffPost thought would take a day took a week. “A lot of times we test business continuity plans under very favorable conditions. We test at night when traffic is low, or we test when the key IT people aren’t stressed out. Whether it's human or natural factors, these things add up and recovery takes significantly longer than we planned for,” he said.
Then when another storm (a Nor’easter) was bearing down on New York shortly after Sandy, and with the election looming, the HuffPost IT team worked day and night to get a full site up and running in their Washington, D.C. data center. They pulled off this Herculean feat, barely, and on the highly trafficked day of the election, HuffPost experienced no problems.
Yet when the IT team had time to catch its breath, they realized that they just barely dodged another bullet. Washington, D.C., after all, felt the effects of Sandy too, albeit to a much lesser extent. Of course, Pavley and his team can’t be blamed for thinking that Washington, D.C. is regionally distinct from New York. That may no longer be the case, but only because we’re entering a new era of extreme weather.
Another lesson that Pavley and his team learned is that our communications infrastructure is not nearly as resilient as it should be.
“We as an industry have become a part of Maslow’s hierarchy of needs,” Pavley said. That is, the Web is now essential. The weather Web sites and traffic Web sites users accessed (to see which evacuation routes were open) all proved to be critical lifelines.
“This [online media] is how people communicated into and out of devastated areas and learned about what was going on. This isn’t just entertainment anymore. This is how people live their lives. This is a responsibility, which before now, we didn’t realize we had. Media sites have a responsibility, and carriers do too, and we all have to figure out how to ensure that we’ll do a better job next time,” Pavley said.
Key points: Regional geographic variation is taking on a different meaning now, and backup sites must be moved farther away than we all thought; data recovery will take longer than you expect; critical communications infrastructure is not nearly as resilient as it should be.
BUMI: Submerged Under Seawater
You’d think that if anyone would be ready for a storm like Hurricane Sandy, it would be a disaster recovery firm like BUMI. The BUMI internal DR team held a conference call the day before Sandy hit New York and determined that this storm had the potential to be really, really bad.
“To get through this storm, we realized that we had to become customers of ourselves. We invoked the highest-level DR plan that we offer for our customers,” said Jennifer Walzer, CEO of BUMI. The night before the storm hit, the team decided to move their online presence to data centers in Toronto and Vancouver.
It was a good thing they did. When Sandy blasted New York City, BUMI’s office building was submerged in 35 feet of seawater. They lost power and as of late November still had no access to their office building. But because their servers failed over to a data center in Canada, they were able to remain operational in spite of not having a physical office.
Additionally, BUMI’s servers were at Verizon’s central office, which was completely submerged, with all of its copper wire destroyed. BUMI may have had to move both its office and its central data center.
BUMI relied on VoIP phones, so once communications started to fail, the VoIP system failed over to employee cell phones. “That was the good news,” Walzer said. “The bad news is that we didn’t think to have Cisco phones in everyone’s homes.”
Walzer told BUMI’s operations manager to FedEx Cisco VoIP phones to employees who needed them, but for a few days, all of BUMI’s customer support – and this was a peak period for support, obviously – was handled via cell phones.
“That was a big lesson for us,” Walzer said. “It’s not enough to just be able to work from home, you have to be able to work from home as if you were working from the office.”
BUMI was able to keep all of its customers happy, yet working as hard as they did took its toll on staffers.
“You need to be able to lean on your coworkers, and that’s just not something you plan for,” she added. Since BUMI has no physical office space at the moment, employees felt out of the loop. They needed face time with coworkers, so they started meeting up once a week in the city.
This need for employee communication isn’t something you typically plan for ahead of time, but if you don’t realize it as you’re coping with a disaster, employee performance can suffer.
Key points: You need to plan for being out of your office for weeks or months, not just days; being ready to work from home doesn’t mean you have a “home office”; the social connections among coworkers are important and must be supported.
SGFootwear: Choosing a Backup Site
Shoe manufacturer SGFootwear was another organization that learned how dangerous it is to have your backup site in the same geographical region as your primary one.
SGFootwear’s main site is located in Hackensack, NJ, while its backup site is a mere 12 miles away in one of its warehouses in Kearny, NJ. Sandy brought down all the communications lines to the main site, but otherwise there was no damage to the building. However, the backup site was flooded.
“We originally chose the backup site for easy access,” said Gregg Asch, Director of IT at SGFootwear. For critical backups, the company uses the backup solution from Nimble Storage. “We figured that if anything went wrong, it wouldn’t take long to get down to Kearny, grab the Nimble device, bring it back to our main site, and we’d be back up in no time. Obviously, if we would have studied flood patterns, we would have made a different choice.”
Now, SGFootwear plans to move its backup site to a more geographically distinct location, with the eventual goal of having a mirror site across the country in Los Angeles. The other lesson that Asch hopes his organization learns is the value of mobility. Currently, SGFootwear discourages telecommuting and is wary of BYOD – but this needs to be reconsidered. “This is something I’ll be stressing as we do the post-mortem on this event,” Asch said. “We should be more decentralized and less reliant on any one location.”
One final point Asch stressed was the value of virtualization. Their Nimble environment is a virtualized deployment. “If our main site had been trashed and the backup survived, we would have been able to move the Nimble device, spin up the VMs and get going. You don’t need like equipment in both sites. Virtualization makes the whole process easier,” he said.
Key points: Choose sites to be not just geographically but topologically distinct; telecommuting and BYOD should be part of your DR plans; virtualization makes recovery easier.
Dimension Data: Helping Employees
Having employees prepped for a disaster is about a lot more than spinning up new servers in the event of a failure. IT services company Dimension Data’s office in Framingham, MA was affected by the hurricane, though not nearly as bad as many others. The office lost power, but they rolled over to a generator and were fine.
However, employees were working long hours taking customer-service calls. For some employees, this meant that Dimension Data had to rent hotel rooms across the street from the office, so they cloud get some quick rest. For others, it meant equipping their home offices, so those locations could stay online.
“Our corporate strategy is to give employees as much flexibility as possible,” said Darren Augi, VP of IT for Dimension Data. “Our workforce can really work from anywhere. Our home office employees are just as ready to go in a disaster as the in-house ones are.”
The reason the remote employees are just as ready is because Dimension Data made this a priority, by ensuring they had redundant means of communications, and even going so far as to issue portable generators suitable for powering sensitive electronic devices.
One thing Sandy drove home, though, is that communications infrastructure in the U.S. is falling behind other developed countries. “In a disaster scenario, 3G is just not enough for employees,” Augi said.
The other thing Augi stressed is that organizations must remember that employees are making major sacrifices to work during this time. Many of them have been affected, and you have to respect that.
Key points: Plan to help employees get through the hard work as they deal with the disaster by, for instance, providing nearby hotel rooms; give remote workers the tools they need to stay online; remember that it’s not just your organization and customers struggling through the disaster, but your employees as well.
iCIMS: Protecting Internal and External Resources
SaaS provider iCIMS managed to keep its talent acquisition solutions online despite the storm, with none of its 1,300 clients losing service. With fully redundant data centers in North America, UK and China, its decentralized IT model served it well.
However, the big lesson learned was that it needed to provide its internal business applications with the same level of protection as its external applications.
With corporate offices in Matawan, NJ, just a couple of miles from the Atlantic shore, damage to an electrical substation meant that the offices were without power for a week.
“Even though we had a week-long disruption, we managed to keep our sales operations in place enough that we continued to add customers,” said Naveed Chaudhry, a network admin at iCIMS. “However, we suffered on our internal applications.”
iCIMS discovered that conducting regular full backups of corporate data and transporting tapes offsite every week may not be a failsafe strategy in the face of Mother Nature. “We thought this was foolproof but it didn’t work, as we didn’t have power,” Chaudhry said.
He noted that having a generator internally wouldn’t work for a company of the size of iCIMS. A generator would provide power in a pinch, but it would require upkeep and wouldn’t really be economically feasible. In any case, conditions were so bad that even those attempting to commute to the office faced impassible roads.
Therefore, he coped by helping personnel to operate remotely. The company already had a number of remote staffers and they largely carried the load. But even this presented hurdles. The company had half a TB of data on the cloud with Zetta, an enterprise-grade cloud storage provider. After talking with Zetta, Chaudhry arranged for the backup service to overnight a backup disk, as that would be faster than trying to download the data over the Internet.
Initially, the plan was to set up a new server and restore the entirety of the corporate data at another location where power was available. But tight time constraints required the company to prioritize what got restored.
“When we got the disk, we decided to only restore those portions our remote users required to operate,” he said. Moving forward, the plan is to emulate its external application strategy.
“Our best option is to have our core corporate services outside of the office – move all critical applications at headquarters to a data center and make them fully redundant like our SaaS apps,” Chaudhry said. “The lesson we learned is to decentralize our internal operations just like our external ones.”
Key points: Internal operations must be covered by DR plans as well as external data and apps – decentralization is valuable; in a pinch, your cloud backup provider can send you a disk for expedited restoration; a backup generator may or may not provide necessary ROI.
Magtype/CR: Multiple Backup Apps
Magtype/CR provides computer support for companies in southwestern Connecticut. When news of the storm came, Erik Shanabrough, an IT Support Technician for the company, decided to get ahead of the game. He and other support colleagues communicated with their customers several days in advance to work out a disaster recovery plan on what to do if the power went out. Full backups were created and moved offsite.
“Where possible, we went to locations prior to the storm and grabbed redundant copies of the data to take to a second location – we stored the backups in a waterproof safe,” Shanabrough said. “At one location we physically moved the server away from a windowed area to the center of the office.”
In some cases, he used CrashPlanPro for offsite backups, and in others he uses Acronis, SyncBackSE, Retrospect or the built-in Windows Server backup software. For Macs, he uses Carbon Copy Cloner for imaging and Chronosync for files.
Some customers fared better than others. One business has offices situated about 100 yards from the coast. The first floor flooded. Fortunately, the IT equipment was housed on the upper floors. However, power was lost and the building closed for two days. Power was restored on the second day and, in the meantime, staff could access email.
“Power was out a little over a day with multiple servers offline but users were able to access FuseMail to get incoming mail and access the last 14 days of traffic,” Shanabrough said.
Key points: Remember the basics: plan ahead of time, communicate with colleagues, use offsite locations and backup early and often.
Disaster Recovery Overview: IT and More
While this story’s focus is on IT disaster recovery, as we did our reporting work we also heard some very compelling stories that weren't strictly IT. For instance, the staff of Satellite Dialysis, in Hamilton, New Jersey, made sure to call patients daily to check on their status. Missing a dialysis treatment can be fatal, so despite four days without electricity, Internet, phone or heat, the Satellite staff found ways to get a hold of their patients. Meanwhile, they had to keep medicine refrigerated offsite, navigate dialysis centers with flashlights, and even ship supplies to a patient stranded in Florida. There were many more stories of businesses coping like this, of going the extra mile.
On the flip side, many of the organizations hit the hardest simply didn’t want to talk about their IT failures. And many were suffering from “Sandy fatigue,” as the PR representative of Internap wrote to us. (In lieu of an interview, he pointed us to Internap’s blog, which includes Sandy-related entries.)
Some of the additional takeaways we learned as we spoke to many, many businesses about Hurricane Sandy include:
• People have a keen sense of perspective during events like these. No boss in the world, or not a single one we heard about, is going to ask someone coping with a completely flooded home to come to work before dealing with family issues.
• It may sound like a Lifetime movie cliché, but people really do come together during disasters, and those who don’t lose the trust of colleagues in the organization going forward.
• Your customers will be understanding. We heard this time and again. Customers didn’t expect miracles; they just wanted to know how to move forward.
• If you devote the time and resources necessary to come up with a truly workable DR plan, your life will be much, much easier when the next disaster hits.
• However, if you don’t train your employees on that disaster recovery plan, your advance work may be wasted. Catherine M. Lepone, Director of Development at the Making Headway Foundation in Chappaqua, NY, was able to work through the storm by logging on to servers remotely to check email and voice mail. Yet she noted, “I found out I was the only one who knew how to do that.” Do all your employees understand this basic technology?
• If you dodged a bullet this time, don’t push your luck. Extreme weather events are becoming more and more common. After a year in which a third of the country experienced forest fires, nearly as much was affected by drought, and a huge swath ended up under water, denying the effects of climate change puts you in the same neighborhood as Flat Earthers. Remove that tinfoil hat and start preparing now, before it’s too late.
10 Disaster Recovery Lessons from Hurricane Sandy
Now that time has passed since Hurricane Sandy struck the East Coast, businesses in the region have had time to assess the damage and begin the recovery process. Some escaped relatively unscathed, not even losing power during the storm; others will take months to recover – if they are able to do so at all.
Economists warn that Sandy could be one of the costliest hurricanes in U.S. history – even though it was far from the most powerful to hit the country. At the time of writing, FEMA had already approved $844 million in assistance. New Jersey Governor Chris Christie's office has estimated that the cost to New Jersey's economy alone will reach $29.4 billion, and New York Governor Andrew Cuomo predicts $33 billion in costs for his state.
So what disaster recovery lessons can IT managers learn from this costly disaster? What can those in other parts of the country do now to make sure they're ready for a similar event?
We talked with some IT professionals who experienced the storm and its aftermath firsthand to see what disaster preparedness and recovery advice they would offer. They suggested ten lessons IT managers should take away from the event.
1. Your Business Isn't Safe
Headquartered in Melville, NY, on Long Island, FalconStor, a data protection and storage virtualization vendor, experienced Sandy firsthand. Ralph Wynn, senior product marketing manager for the company, said one of the key things IT managers should learn from the storm is that "Your business is not safe, no matter where it is located."
Hurricanes don't hit New York and New Jersey very often, but Sandy demonstrated that "not very often" isn't the same as "never." Even if you feel safe in your particular location, that doesn't mean you actually are safe. According to Wynn, disaster recovery planning typically doesn't happen until someone at a business recognizes the truth that a disaster is going to happen – it's just a matter of when.
2. Plan, Plan, Plan (and Then Plan Some More)
People on the ground in New York said that, with a few notable exceptions, large enterprises did have adequate disaster recovery plans in place. In fact, publicly traded companies are required to have such plans in order to meet their compliance requirements.
But for small businesses, the story was a lot different.
Joe Hillis is the operations director for the Information Technology Disaster Resource Center (ITDRC), a non-profit made up of IT professionals who go into disaster areas and help the people recover. When asked how many small business IT departments typically have disaster recovery plans, he said, "I've not met a one yet in my career that has one."
Hillis urged the IT managers at those small firms, "Put a plan in place. I don't care if you write it on the back of a napkin. The main thing is, identify what's important to your business and find a place to keep it safe offsite somewhere... Just have a plan to know what you're going to do."
"Plan, plan, plan," agreed Wynn. "Ask the critical questions – what happens if the administrative staff can't get to the office? What are our fail-safes? Do we have a way to move all the operations to another location? Is that alternate location also going to be affected by the event that's taken place? Do we have written guidelines on who is in charge of what? That type of planning needs to happen, and it needs to come from the top down."
3. Test Your Solutions
Disaster recovery planning alone isn't enough. The experts agreed that companies need to test the technology they plan to use in a disaster situation in order to make sure that it will perform as intended.
Sean Hull, an independent scalability consultant located in Manhattan, works with several data centers and large enterprises in the area. He noted that companies need "to do fire drills with different scenarios to find out if we lost this and we lost this, what would happen? How long would it take us to rebuild? Would we have all our data? Could we move it to another data center?"
Wynn noted that, since Sandy, FalconStor has seen an uptick in customers who want to use the firm's testing services. "They want to have a mechanism in place that allows them to run a stress test of their environments, but they don't have the equipment or the manpower to do that on their own on a regular basis," he explained.
"So they are looking to vendors such as FalconStor to provide some way for them to use their existing infrastructure or give them a mechanism where they can actually test failing over services such as Exchange or CRM through Oracle over to another building, another site, or even moving them from physical environments to virtual environments."
4. Consider the Cloud
Public cloud providers are becoming an integral part of disaster preparedness for firms of all sizes. Hillis noted that in the case of small businesses impacted by Sandy, "If someone had a cloud-based solution, they could at least go to a hotel in New Jersey or Washington or somewhere where they had power and still resume their computer operations. I'm a big proponent of doing that. It doesn't make sense for everybody, but there are a lot of businesses that it does make sense for."
Wynn and Hull both said that mid-size and large enterprises should also look at using cloud providers – perhaps even multiple cloud providers, as part of their disaster planning.
5. Don't Put All Your Eggs In One Basket
Experts recommend that companies have a failover site and/or the capability to move critical operations to the cloud so that work can continue after a disaster. The best plans often use a combination of both and include multiple sites and vendors.
Hull recommended, "Look at using multiple providers. Don't put all your eggs in one basket, so to speak. That can really make a big difference. If a hosting service like Amazon has an outage maybe in their northern Virginia data center, it can take out a number of Internet businesses. If those firms have assets and code that's also hosted at, say, Joyent or Rackspace in parallel, then an outage that affects one provider most likely wouldn't affect more than one of them."
Similarly, Wynn recommends "a two-prong approach – having your own facility or maybe leasing a facility and then a vendor such as Amazon, Rackspace, maybe partially move it to the cloud to a data center that's located somewhere else geographically."
6. Locate Your DR Site Strategically – But Be Aware of Latency
As noted earlier in this article, a failover site does you no good if it gets hit by the same disaster that strikes your primary site. "If you are on the East Coast and you are by the water, you should not use a facility that's thirty miles away," explained Wynn. "You should be looking at a couple of hundred miles or maybe another state to your west that's centrally located. If you have the capability and the dataset to do so, maybe look at something on the other side of the United States."
The difficulty with using a failover site that's located far away is that latency can become an issue. Wynn advised that IT departments carefully consider their SLAs and think about how quickly IT operations need to be back up and running in case of a traumatic event. When considering a failover site or a cloud-based solution, they'll need to find a balance between optimum safety and minimal latency.
7. Expect Some Surprises
No matter how well you prepare, it's nearly impossible to anticipate everything that will occur after a disaster.
One of Hull's clients, Prometheus Global Media, has a data center in New York. "They had power generators on the roof, as it turns out, so they were prepared," he recalled.
However, despite their foresight, Prometheus did run into a few unforeseen snags. In order to run, their generators needed fuel, of course. But when the power went down, the elevators weren't working, making it a lot tougher to get the fuel to the roof where it was needed.
Hull, who advises data centers on disaster preparedness, among other topics, said that he had never considered the fact that so many electrical lines run through New York's subway tunnels and what that would mean in the case of a flood. He was surprised when Con Edison turned off the power to his home, which doubles as his workspace, in advance of the storm. The torrent of saltwater also caused an unexpected explosion at a Con Edison substation near his apartment on 14th St.
IT managers at companies of all sizes will need to be flexible and react to unexpected events like these as they occur.
8. Who’s In Charge? Coordinate and Communicate
The IT pros who dealt with Sandy also noted that during the initial response period, it was critical that employees knew who was in charge of a company's response and that people with recovery know-how were able to stay in touch with each other.
"Recovering from major outages like that is about coordination and communication -– keeping the lines of communication open so that all the people with all the knowledge are able to take action, they don't have their hands tied," Hull said.
9. Don't Expect to Recover Right Away
Having been on the scene at numerous disasters, Hillis has observed that people often don't understand that things will be vastly different after a major event. As a result, businesses won't be able to recover as quickly as they might think.
For example, he said that business owners imagine that if their building is destroyed, they'll be able to rebuild within a few months – after all, it only takes three months to put up a structure. But after a disaster, local governments often stop issuing building permits for 60 to 90 days, and even when permit issuance does resume, there is often so much work that contractors are overwhelmed. It can be impossible to even find a construction crew. He said companies often need longer-term plans for dealing with the disaster.
In the aftermath of Sandy, FalconStor had employees off work or working remotely for a couple of weeks, despite the fact that their building sustained no notable damage. Although the company headquarters was operational, many employees had a difficult time getting in to work for a couple of weeks because of power outages and gas shortages in areas near where they live. When employees did make it in to the office, FalconStor ended the work day early so that employees could get home before dark.
10. Getting By With Help from Your Friends – And Strangers
"The small businesses think that they built their businesses by themselves, so they can recover by themselves," Hillis said. "But that's just not the reality."
All up and down the East Coast, individuals and small businesses found themselves relying on the generosity of friends, neighbors and strangers in order to make it through the experience. Fortunately, most people were more than willing to assist those in need.
As a Texan, Hillis came to the East Coast with some preconceived notions about "tough" New Yorkers, but he was amazed to see how people within the tech industry reached out to help each other.
"I've seen guys take public transportation three hours each way to go help somebody," recounted Hillis. "These are the IT guys who are helping us up here. They would take a train, they would take a bus, and then they would walk a mile and a half, whatever, just to come down and work with us for six or eight hours and then turn around and leave at 7:00 and redo the whole thing backwards. I can't say enough good things about the people up here. They're fantastic."
The ITDRC had about 50 volunteers working in New York and New Jersey, and Hillis said they were working with "hundreds" of other IT professionals from NY Tech Meetup to get individuals and small businesses the technical assistance they needed. They set up PCs, WiFi access points and other communications equipment at disaster recovery centers so that residents and small businesses could have access to the technology and connectivity they needed. They also provided equipment and installation assistance directly to small business owners.
Hillis concluded his comments on lessons from the hurricane with a plea: "We would love to have people involved. We would love to have their support financially and in kind as well. It's going to take quite some time to recover." Those within the IT industry who would like more information about assisting with disaster recovery can visit the ITDRC's website for more information.