Facebook says that a configuration error broke its connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable, the company said.
The unusual combination of errors took down the web operations of Facebook, Instagram and WhatsApp in a massive global outage that lasted more than five hours. In effect, Facebook said, a single errant command took down web services used by more than 7 billion accounts worldwide.
Early external analyses of the outage focused on Facebook’s domain name servers (DNS) and changes in a network route in the Border Gateway Protocol (BGP), issues which were clearly visible from Internet records. Those turned out to be secondary issues triggered by Facebook’s backbone outage.
During planned network maintenance, “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” according to a blog post by Facebook VP of Infrastructure Santosh Janardhan.
The errant command would normally be caught by an auditing tool, but “but a bug in that audit tool didn’t properly stop the command,” Facebook said.
Technical Overview of the Facebook Outage
Here’s the section of the blog post that explains the issue and resulting outage, which is worth reading in full:
The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.
This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.
One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).
To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.
Manual Restarts Extend the Delay
Recovery became difficult because all Facebook’s data centers were inaccessible, and the DNS outage hobbled many network tools that would normally be key in trouble-shooting and repairing the problems.
With remote management tools unavailable, the affected systems has to be manually debugged and restarted by technicians in the data centers. “It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” said Janardhan.
A final problem was how to restart Facebook’s huge global data center network and handle an immediate surge of traffic. This is a challenge that goes beyond network logjams to the data center hardware and power systems.
“Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” said Janardhan.
The data center industry exists to eliminate downtime in IT equipment by ensuring power and network are always available. A key principle is to eliminate single points of failure, and Monday’s outage illustrates how hyperscale networks that serve global audiences can also enable outages at unprecedented scale.
Now that the details of the outage are known and available, Facebook’s engineering team will assess what went wrong, and seek to prevent a similar issue from recurring in the future.
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one,” Janardhan said. “After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway. … From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”
I worked for a company that does a lot of work on data center floors. Kaiser….Cisco to name a few. I’m just happy it wasn’t me who accidentally took down Facebook lol
I wished Facebook would stay down forever
But, why they take hours to put again the last good configuración??
Read the article
It’s been 9 years since FB had an issue at this scale. Not a bad f’ing record.
… and the world was a better place for five hours.
I didn’t even notice, and I’m a network engineer.
Why don’t they have High Availability feature on DNS & BGP?
BGP is high availability. Programming bugs caused the issue, not the protocol itself.
BGP is HA.
It want a bug in any protocols, not BGP, not DNS.
They simply issued valid commands to them, and they did what they supposed to do.
It is not UNIX’s job to stop you from shooting your foot. If you so choose to do so, then it is UNIX’s job to deliver Mr. Bullet to Mr Foot in the most efficient way it knows.
I know how that would happen for sure being in IT. But it is funny that it happened just after the report on TV about their fishy practices. Almost like they didn’t want any hackers digging around the data until they found it the.selves and got rid of it.
Or simple misdirection. Pay attention to the outage, not the congressional testimony…
That would be my guess.
Silly
WSJ and SEC already had the docs a month ago.
Lol just admit you know nothing about the situation and go home the data was all released months ago xD and “misdirection” y’all need to chill with your conspiracy theories because the way every news outlet reported this was “Facebook goes down during court”, nobody outside of the tech community even knew there were hearings for Facebook until this. Them going down got the word out.
5 hours, plenty of time to get rid of any evidence
I’m sure that was the issue but what was the true cause other reports suggested that employees were locked out of the buildings cuz their security badges wouldn’t work They couldn’t get their email and their phone systems couldn’t work me that sounds like a coordinated attack from the inside on all their systems not just a network configuration issue. Further an issue like that should have been easily remediated. There’s more to this story than they’re willing to tell
People were locked out of the buildings, because upon scanning their security badges the gate would communicate with an internal server to get info on the badge, and whether they’re allowed to enter. No network traffic in any way means no network traffic from the security gate to that internal server. Meaning no one was able to enter.
Their email and phone services wouldn’t work either, because those too go through their network. And again: no network traffic means no communication between phones/email and servers.
The article even says that the command that caused all this should have been caught and denied by an audit tool, but due to a bug it didn’t.
So, yes, all of this was indeed caused by ‘just a network configuration issue’…
I’m not denying that it could be an internal attack, but to me their explanation is plausible.
Do you not know what “our servers completely disconnected from the internet” means? I love all the people who know nothing about network systems commenting ridiculously stupid theories here. You could better spend this time by actually learning about the systems you’re talking about.
I found it refreshing to see someone took them down a notch whether it was intentional or not. I spent the day in relative peace not worried about getting out in Facebook jail, or slapped for spreading actual truth and not some bs fact check gumming up the works. It was nice. Should happen more often, so I now limit it to specific things and times… Unplug from them and they will cease to be relative. Someone else should take up the slack in less invasive and quality social media.
I would like to try to answer some of the questions in the comments. While I have only worked on small networks, I have seen failures like these. Naveed asks why they don’t have high-availability DNS and BGP. They do. The problem is that an HA system only protects you against a failure of the components which provide the service. It does not protect you against a failure of the system which decides which service components should be in service. In this case the HA system decided that the good DNS servers were bad and took then all offline.
Aurelio asks why it took them five hours to switch to a known good configuration. The reason is that they are operating all this equipment remotely over the network connection which they just shut down. It is like locking your keys in your car. You can’t just ‘undo’ it. Now you need someone to go to the data center, find a particular piece of equipment, and enter a specific command to start things back up. Plus now you are blind, so you may not even know what happened or what command needs to be entered where. Meanwhile the DNS HA system is taking your DNS servers off line. With DNS shutting down e-mail and the phones will stop working in a few minutes. And soon you won’t be able to get to the office servers which have the emergency contact lists. So now not only are your keys locked in your car, your phone is too.
When someone finally gets to the data center, he can’t get in because the card reader can’t contact the server to figure out whether he is authorized. Hopefully there is a security guard there with keys. If not, there will be more delays waiting for someone to bring keys or a crowbar.
Once you have someone in, your problems are still not solved. Somebody probably needs to talk him through going to a console somewhere and typing a bunch of cryptic commands to figure out what got shut down and where he needs to go to turn it back on.
So it is not really surprising that it took them five hours to get things working again. It could have very easily taken more.
Come on You really think anyone here has a clue what you said ? lol Besides me that is.
The one thing that is sure Everyone that posts their enjoyment FB is having issues and what they would do are still on FB every day knowing what happens there never stays there, they know they get tracked, etc. but few really know or care what it takes to run a site like this. My hats off to FB . Now stop sending me ads and sales lol.
Remote access systems if correctly configured can be accessed from the outside world even in the event the DNS is erased. On our systems we can access the operating system via SSH remotely using a network port that gets its configured from an on board room chip. Again it is unlikely someone as large as Facebook doesn’t have this. It’s also worth noting, no these updates are not always done remotely. Even when the bulk of the work is being done remotely a recovery team is at every data center during updates for this reason (at Facebook likely 247), your assumption they would be doing something completely remotely seems to come from an assumption.
Dude 4G serial console would have fixed this
Five hours is not a lot, it’s even a miracle they have been able to fix it in five hours,
Some mistakes take seconds to make and hours to fix,
The Audit System should have STOPPED the faulty “assessment” command before it was executed!
“To err is human. To really F— things up takes a computer!”
(Fowl…what do you think I meant?)
H
So what were they hiding? What software changes did they make at that rare political juncture? Were they putting the network back to the pre-election 2-level deep AI? It would treat your likes and hates by keeping you in your local group. Then after the election, everything back to normal. This sounds like an explanation that doesn’t explain anything.
They literally explained everything that happened exactly. You just don’t know anything about network systems so you didn’t understand it dude.
Five hours is enough time for this lying sack of a company to destroy any evidence of wrong doing after its beloved robotic leader was just unemotionally denying that Facebook is involved in any wrong doing.
Facebook knows that it’s harming society but it doesn’t care as long as it keeps getting revenue.
We have pockets of mass hysteria and psychosis all over the country, thanks largely in part to Facebook, Instagram, Twitter, and Tik Tok, with FB, TWR, and Insta being the unholy trinity.
There are studies out now about how these sites affect people’s mental health.
I gave it all up last year as the pandemic and tensions across the country heated up.
Life is happier and quieter without the endless noise and rubbish of social media platforms like Facebook..
There’s also a plenty of communications platforms available for people to keep in contact with friends and family without an umbilical cord to facebook
I keep my online presence limited to a few message boards and YouTube but other than that, no other social media.
The best thing that could happen for humanity now is if Facebook and all its apps, Twitter, and Tik Tok all went down for good.