• About Us
  • Partnership Opportunities
  • Privacy Policy

Data Center Frontier

Charting the future of data centers and cloud computing.

  • Cloud
    • Hyperscale
  • Colo
    • Site Selection
    • Interconnection
  • Energy
    • Sustainability
  • Cooling
  • Technology
    • Internet of Things
    • AI & Machine Learning
    • Edge Computing
    • Virtual Reality
    • Autonomous Cars
    • 5G Wireless
    • Satellites
  • Design
    • Servers
    • Storage
    • Network
  • Voices
  • Podcast
  • White Papers
  • Resources
    • COVID-19
    • Events
    • Newsletter
    • Companies
    • Data Center 101
  • Jobs
You are here: Home / Cloud / Facebook: We Disconnected Our Data Centers From the Internet

Facebook: We Disconnected Our Data Centers From the Internet

By Rich Miller - October 5, 2021

Facebook: We Disconnected Our Data Centers From the Internet

The Facebook Fabric Aggregator, a system that manages data traffic between its data centers. (Photo: Rich Miller)

LinkedinTwitterFacebookSubscribe
Mail

Facebook says that a configuration error broke its connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable, the company said.

The unusual combination of errors took down the web operations of Facebook, Instagram and WhatsApp in a massive global outage that lasted more than five hours. In effect, Facebook said, a single errant command took down web services used by more than 7 billion accounts worldwide.

Early external analyses of the outage focused on Facebook’s domain name servers (DNS) and changes in a network route in the Border Gateway Protocol (BGP), issues which were clearly visible from Internet records. Those turned out to be secondary issues triggered by Facebook’s backbone outage.

During planned network maintenance, “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” according to a blog post by Facebook VP of Infrastructure Santosh Janardhan.

The errant command would normally be caught by an auditing tool, but “but a bug in that audit tool didn’t properly stop the command,” Facebook said.

Technical Overview of the Facebook Outage

Here’s the section of the blog post that explains the issue and resulting outage, which is worth reading in full:

The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.

This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.

Facebook Showcases its 40 Million Square Feet of Global Data Centers

This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.

One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.

Manual Restarts Extend the Delay

Recovery became difficult because all Facebook’s data centers were inaccessible, and the DNS outage hobbled many network tools that would normally be key in trouble-shooting and repairing the problems.

With remote management tools unavailable, the affected systems has to be manually debugged and restarted by technicians in the data centers. “It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” said Janardhan.

OVH Data Center Destroyed by Fire

A final problem was how to restart Facebook’s huge global data center network and handle an immediate surge of traffic. This is a challenge that goes beyond network logjams to the data center hardware and power systems.

“Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” said Janardhan.

The data center industry exists to eliminate downtime in IT equipment by ensuring power and network are always available. A key principle is to eliminate single points of failure, and Monday’s outage illustrates how hyperscale networks that serve global audiences can also enable outages at unprecedented scale.

Now that the details of the outage are known and available, Facebook’s engineering team will assess what went wrong, and seek to prevent a similar issue from recurring in the future.

“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one,” Janardhan said. “After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway. … From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”

LinkedinTwitterFacebookSubscribe
Mail

Tagged With: Facebook

Newsletters

Stay informed: Get our weekly updates!

Are you a new reader? Follow Data Center Frontier on Twitter or Facebook.

About Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Comments

  1. odie1768@msn.com'Mike R says

    October 7, 2021 at 12:36 am

    I worked for a company that does a lot of work on data center floors. Kaiser….Cisco to name a few. I’m just happy it wasn’t me who accidentally took down Facebook lol

    • dgeary12345@gmail.com'Me says

      October 8, 2021 at 10:59 am

      I wished Facebook would stay down forever

  2. aurelio.rodas@gmail.com'Aurelio says

    October 7, 2021 at 7:56 am

    But, why they take hours to put again the last good configuración??

    • aaron@realestatepd.com'Tony says

      October 8, 2021 at 12:05 pm

      Read the article

  3. zoidno@telladata.com'ZM says

    October 7, 2021 at 9:13 am

    It’s been 9 years since FB had an issue at this scale. Not a bad f’ing record.

  4. computerworkz@gmail.com'4ward says

    October 7, 2021 at 9:19 am

    … and the world was a better place for five hours.

    • byoungmn@gmail.com'Brian Young says

      October 8, 2021 at 1:09 pm

      I didn’t even notice, and I’m a network engineer.

  5. Naveedtom@yahoo.com'Naveed says

    October 7, 2021 at 12:44 pm

    Why don’t they have High Availability feature on DNS & BGP?

    • noreply@networkengineer.com'John White says

      October 7, 2021 at 8:24 pm

      BGP is high availability. Programming bugs caused the issue, not the protocol itself.

    • karseraslampros2703@gmail.com'kokos says

      October 10, 2021 at 8:30 am

      BGP is HA.
      It want a bug in any protocols, not BGP, not DNS.

      They simply issued valid commands to them, and they did what they supposed to do.

      • ndm@ndm.guru'Nathan says

        October 10, 2021 at 6:02 pm

        It is not UNIX’s job to stop you from shooting your foot. If you so choose to do so, then it is UNIX’s job to deliver Mr. Bullet to Mr Foot in the most efficient way it knows.

  6. michael.levine.va@gmail.com'Michael Levine says

    October 7, 2021 at 12:56 pm

    I know how that would happen for sure being in IT. But it is funny that it happened just after the report on TV about their fishy practices. Almost like they didn’t want any hackers digging around the data until they found it the.selves and got rid of it.

    • gordon.firemark@gmail.com'Gordon F says

      October 9, 2021 at 2:39 am

      Or simple misdirection. Pay attention to the outage, not the congressional testimony…

    • battlestar5353@gmail.com'Vick says

      October 9, 2021 at 2:26 pm

      That would be my guess.

    • ppgooding@gmail.com'Paul Gooding says

      October 10, 2021 at 4:18 pm

      Silly
      WSJ and SEC already had the docs a month ago.

    • emmsnatasha@gmail.com'ok lol says

      October 11, 2021 at 12:06 pm

      Lol just admit you know nothing about the situation and go home the data was all released months ago xD and “misdirection” y’all need to chill with your conspiracy theories because the way every news outlet reported this was “Facebook goes down during court”, nobody outside of the tech community even knew there were hearings for Facebook until this. Them going down got the word out.

  7. hermitmongoose@gmail.com'Mongo says

    October 7, 2021 at 1:06 pm

    5 hours, plenty of time to get rid of any evidence

  8. scott_ne@hotmail.com'Scott says

    October 7, 2021 at 4:34 pm

    I’m sure that was the issue but what was the true cause other reports suggested that employees were locked out of the buildings cuz their security badges wouldn’t work They couldn’t get their email and their phone systems couldn’t work me that sounds like a coordinated attack from the inside on all their systems not just a network configuration issue. Further an issue like that should have been easily remediated. There’s more to this story than they’re willing to tell

    • martijn.noordermeer@gmail.com'NorthLake says

      October 11, 2021 at 7:45 am

      People were locked out of the buildings, because upon scanning their security badges the gate would communicate with an internal server to get info on the badge, and whether they’re allowed to enter. No network traffic in any way means no network traffic from the security gate to that internal server. Meaning no one was able to enter.

      Their email and phone services wouldn’t work either, because those too go through their network. And again: no network traffic means no communication between phones/email and servers.

      The article even says that the command that caused all this should have been caught and denied by an audit tool, but due to a bug it didn’t.

      So, yes, all of this was indeed caused by ‘just a network configuration issue’…

      I’m not denying that it could be an internal attack, but to me their explanation is plausible.

    • emmsnatasha@gmail.com'ok lol says

      October 11, 2021 at 12:08 pm

      Do you not know what “our servers completely disconnected from the internet” means? I love all the people who know nothing about network systems commenting ridiculously stupid theories here. You could better spend this time by actually learning about the systems you’re talking about.

  9. nomail@gmail.com'Mr. Chip says

    October 8, 2021 at 8:00 am

    I found it refreshing to see someone took them down a notch whether it was intentional or not. I spent the day in relative peace not worried about getting out in Facebook jail, or slapped for spreading actual truth and not some bs fact check gumming up the works. It was nice. Should happen more often, so I now limit it to specific things and times… Unplug from them and they will cease to be relative. Someone else should take up the slack in less invasive and quality social media.

  10. David.Chappell@trincoll.edu'David Chappell says

    October 8, 2021 at 8:00 am

    I would like to try to answer some of the questions in the comments. While I have only worked on small networks, I have seen failures like these. Naveed asks why they don’t have high-availability DNS and BGP. They do. The problem is that an HA system only protects you against a failure of the components which provide the service. It does not protect you against a failure of the system which decides which service components should be in service. In this case the HA system decided that the good DNS servers were bad and took then all offline.
    Aurelio asks why it took them five hours to switch to a known good configuration. The reason is that they are operating all this equipment remotely over the network connection which they just shut down. It is like locking your keys in your car. You can’t just ‘undo’ it. Now you need someone to go to the data center, find a particular piece of equipment, and enter a specific command to start things back up. Plus now you are blind, so you may not even know what happened or what command needs to be entered where. Meanwhile the DNS HA system is taking your DNS servers off line. With DNS shutting down e-mail and the phones will stop working in a few minutes. And soon you won’t be able to get to the office servers which have the emergency contact lists. So now not only are your keys locked in your car, your phone is too.
    When someone finally gets to the data center, he can’t get in because the card reader can’t contact the server to figure out whether he is authorized. Hopefully there is a security guard there with keys. If not, there will be more delays waiting for someone to bring keys or a crowbar.
    Once you have someone in, your problems are still not solved. Somebody probably needs to talk him through going to a console somewhere and typing a bunch of cryptic commands to figure out what got shut down and where he needs to go to turn it back on.
    So it is not really surprising that it took them five hours to get things working again. It could have very easily taken more.

    • assured2@yahoo.com'Suzy says

      October 8, 2021 at 9:55 pm

      Come on You really think anyone here has a clue what you said ? lol Besides me that is.
      The one thing that is sure Everyone that posts their enjoyment FB is having issues and what they would do are still on FB every day knowing what happens there never stays there, they know they get tracked, etc. but few really know or care what it takes to run a site like this. My hats off to FB . Now stop sending me ads and sales lol.

    • nikky.d23@gmail.com'Stephen Driver says

      October 10, 2021 at 2:16 am

      Remote access systems if correctly configured can be accessed from the outside world even in the event the DNS is erased. On our systems we can access the operating system via SSH remotely using a network port that gets its configured from an on board room chip. Again it is unlikely someone as large as Facebook doesn’t have this. It’s also worth noting, no these updates are not always done remotely. Even when the bulk of the work is being done remotely a recovery team is at every data center during updates for this reason (at Facebook likely 247), your assumption they would be doing something completely remotely seems to come from an assumption.

    • kingofnyct@gmail.com'Mr fixit says

      October 10, 2021 at 7:18 am

      Dude 4G serial console would have fixed this

  11. mathias.eudeline@gmail.com'lk77 says

    October 8, 2021 at 9:52 am

    Five hours is not a lot, it’s even a miracle they have been able to fix it in five hours,
    Some mistakes take seconds to make and hours to fix,

  12. howard_abraham@juno.com'Howard Abraham says

    October 8, 2021 at 11:20 am

    The Audit System should have STOPPED the faulty “assessment” command before it was executed!

    “To err is human. To really F— things up takes a computer!”

    (Fowl…what do you think I meant?)
    H

  13. jimhass@gmail.com'Jim Hassinger says

    October 9, 2021 at 9:08 pm

    So what were they hiding? What software changes did they make at that rare political juncture? Were they putting the network back to the pre-election 2-level deep AI? It would treat your likes and hates by keeping you in your local group. Then after the election, everything back to normal. This sounds like an explanation that doesn’t explain anything.

    • emmsnatasha@gmail.com'ok lol says

      October 11, 2021 at 12:12 pm

      They literally explained everything that happened exactly. You just don’t know anything about network systems so you didn’t understand it dude.

  14. liaophelia13@gmail.com'Marie says

    October 10, 2021 at 4:39 pm

    Five hours is enough time for this lying sack of a company to destroy any evidence of wrong doing after its beloved robotic leader was just unemotionally denying that Facebook is involved in any wrong doing.
    Facebook knows that it’s harming society but it doesn’t care as long as it keeps getting revenue.
    We have pockets of mass hysteria and psychosis all over the country, thanks largely in part to Facebook, Instagram, Twitter, and Tik Tok, with FB, TWR, and Insta being the unholy trinity.
    There are studies out now about how these sites affect people’s mental health.

    I gave it all up last year as the pandemic and tensions across the country heated up.
    Life is happier and quieter without the endless noise and rubbish of social media platforms like Facebook..
    There’s also a plenty of communications platforms available for people to keep in contact with friends and family without an umbilical cord to facebook
    I keep my online presence limited to a few message boards and YouTube but other than that, no other social media.
    The best thing that could happen for humanity now is if Facebook and all its apps, Twitter, and Tik Tok all went down for good.

  • Facebook
  • Instagram
  • LinkedIn
  • Pinterest
  • Twitter

Voices of the Industry

Maintaining Low-Latency and Sustainability at the Network Edge

Maintaining Low-Latency and Sustainability at the Network Edge Schneider Electric's Steven Carlini and Andres Vasquez discuss building data centers at the network edge while also meeting sustainability goals.

White Papers

cooling

Using Simulation to Validate Cooling Design

Kao Data’s UK data center is designed to sustainably support high performance computing and intensive artificial intelligence. Future Facilities explores how CFD can validated the design, implementation, and operation of their indirect evaporative cooling systems.

Get this PDF emailed to you.

We always respect your privacy and we never sell or rent our list to third parties. By downloading this White Paper you are agreeing to our terms of service. You can opt out at any time.

DCF Spotlight

Data center modules on display at the recent Edge Congress conference in Austin, Texas. (Photo: Rich Miller)

Edge Computing is Poised to Remake the Data Center Landscape

Data center leaders are investing in edge computing and edge solutions and actively looking at new ways to deploy edge capacity to support evolving business and user requirements.

An aerial view of major facilities in Data Center Alley in Ashburn, Virginia. (Image: Loudoun County)

Northern Virginia Data Center Market: The Focal Point for Cloud Growth

The Northern Virginia data center market is seeing a surge in supply and an even bigger surge in demand. Data Center Frontier explores trends, stats and future expectations for the No. 1 data center market in the country.

See More Spotlight Features

Newsletters

Get the Latest News from Data Center Frontier

Job Listings

RSS Job Openings | Pkaza Critical Facilities Recruiting

  • Critical Power Energy Manager - Data Center Development - Ashburn, VA
  • Data Center Facility Operations Director - Carrollton, TX
  • Site Development Manager - Data Center - Ashburn, VA
  • Data Center Facility Operations Director - Chicago, IL
  • Electrical Engineer - Senior - Dallas, TX

See More Jobs

Data Center 101

Data Center 101: Mastering the Basics of the Data Center Industry

Data Center 101: Mastering the Basics of the Data Center Industry

Data Center Frontier, in partnership with Open Spectrum, brings our readers a series that provides an introductory guidebook to the ins and outs of the data center and colocation industry. Think power systems, cooling, solutions, data center contracts and more. The Data Center 101 Special Report series is directed to those new to the industry, or those of our readers who need to brush up on the basics.

  • Data Center Power
  • Data Center Cooling
  • Strategies for Data Center Location
  • Data Center Pricing Negotiating
  • Cloud Computing

See More Data center 101 Topics

About Us

Charting the future of data centers and cloud computing. We write about what’s next for the Internet, and the innovations that will take us there. We tell the story of the digital economy through the data center facilities that power cloud computing and the people who build them. Read more ...
  • Facebook
  • LinkedIn
  • Pinterest
  • Twitter

About Our Founder

Data Center Frontier is edited by Rich Miller, the data center industry’s most experienced journalist. For more than 20 years, Rich has profiled the key role played by data centers in the Internet revolution. Meet the DCF team.

TOPICS

  • 5G Wireless
  • Cloud
  • Colo
  • Connected Cars
  • Cooling
  • Cornerstone
  • Coronavirus
  • Design
  • Edge Computing
  • Energy
  • Executive Roundtable
  • Featured
  • Finance
  • Hyperscale
  • Interconnection
  • Internet of Things
  • Machine Learning
  • Network
  • Podcast
  • Servers
  • Site Selection
  • Social Business
  • Special Reports
  • Storage
  • Sustainability
  • Videos
  • Virtual Reality
  • Voices of the Industry
  • Webinar
  • White Paper

Copyright Data Center Frontier LLC © 2022

X - Close Ad