Login
Login

RFO for the outage on 29th Dec 2021


On the 29th of December 2021, bgp.tools lost data about upstreams on the site. This was caused by a change the day before that was designed to prevent faulty data from entering the system (causing false alerts about multiple AS’s originating the same prefix)

Unfortunately this change actually threw away almost all of the data going into the site. To understand why you need to understand that bgp.tools uses redis as a data store for routing information. To ensure that faulty data gets eventually flushed out all keys have a TTL attached to them. So the change applied last night was not noticed until the afternoon after when these TTL’s began to “kick in” and remove data in a way that was visible to visitors.

The data importers did not have end to end tests since that had not yet been done, and fixing was been helped by the fact that the single person who could deploy site changes was in a taxi at the time without access to the servers that run the site.

To prevent this happening in the future, the data importer itself now has unit tests that imports a tiny set of bgp.tools to and runs sanity checks. This should prevent future issues like this.

Effort is slowly being made to setup a CI/CD system to prevent a case in the future where a deployment fix is written, but can’t be deployed due to lack of access to a development environment and matching physical keys.

bgp.tools has no official SLA, but does attempt to be as accurate as possible. During this outage bgp.tools was not accurate since it was showing very little peering data, and zero upstream data. I apologise for this.

Timeline:

28 Dec 2021, 22:53 GMT - Change deployed to fix a bug in data importer

29 Dec 2021, 01:05 GMT - Data importer runs, and removes half of all upstream data points

29 Dec 2021, 04:05 GMT - Data importer runs, and removes the other upstream data points

Incident begins

29 Dec 2021, 13:30 GMT - Ben Informed bgp.tools is broken, but is in a taxi unable to deploy

29 Dec 2021, 13:41 GMT - A PR is written to fix the bug

29 Dec 2021, 16:52 GMT - Patch deployed, data importer is forced to run to fix live issue on site

Incident ends

– Benjojo


← All knowledge base articles