At the end of 2022, CERN’s web infrastructure suffered a major outage: thousands of internal and external websites became unavailable within minutes. My colleague Francisco Borges Aurindo Barros and myself were the ones doing the troubleshooting and cleanup on that day (and the following days). Almost a year later, we presented the incident at the USENIX Site Reliability Engineering Conference (SREcon) in Dublin: timeline of the event, impact, root cause analysis, and most importantly the technical and process improvements we have made since then.
The PDF slides are available here and the video recording can be found on YouTube:
I really enjoyed the atmosphere at the conference and talking to the other participants and speakers: compared to other “industry conferences” this one felt much less commercial and lot more focused on content. The venue (Convention Centre Dublin) was also not too big, which is important because that avoids having to walk eternal distances between each presentation. Finally, exploring the pubs and distilleries of Dublin added to the experience. :-)