Starting on October 5, 2020 at 5:58 a.m. PDT, some customers experienced degraded functionality in Slack due to slow API performance. Since many of Slack’s features rely on making API calls, customers experienced latency in almost all aspects of our service. This included, but wasn’t limited to, the sending and loading of messages in threads and channels, as well as searching, calling, and accessing administrative features.
We initially identified an overloaded database memory cache server as a source of instability. At 8:10 a.m. PDT, we provisioned a replacement server for this host. When it came online, it unexpectedly caused additional instability and further degraded the performance and availability of our API. The replacement server came fully online with no residual instability in our database memory cache layer by 8:52 a.m. PDT.
While we were working on the previous memory cache issue, our team also began looking into problems with our internal service discovery system. This system is responsible for routing traffic within our infrastructure to the multiple services that support various features of Slack. We made several changes to this system over the course of the next several hours to remediate the impact of its instability. Throughout this time, our service was degraded, but was still available.
At 1:25 p.m. PDT, one of these configuration changes inadvertently degraded Slack’s availability further and triggered an outage. We immediately identified and deployed a fix for this issue, which finished rolling out to our server fleet at 1:38 p.m. PDT. At this point, we determined the system was fully recovered and operating normally for all customers.
Around 5:00 p.m. PDT, we received reports about portions of the product that were still impacted by ripple effects from the earlier disruption in our service discovery system. This included search, which we remedied at 5:40 p.m. PDT. We also saw lingering errors in applications using our Real Time Messaging (RTM) API, and resolved those problems by 10:30 p.m. PDT with similar remediation steps.
Our teams have been working around the clock on ensuring that Slack remains available, fast, and reliable. While we haven’t yet resolved all the underlying nuances of the degradation in our service discovery system, we’re confident that we have a path forward and that we have proper mitigations in place to continue this work. We don’t expect any additional customer impact due to this issue. However, if similar problems do arise, we’ll provide updates on this site. If you would like to receive a full Root Cause Analysis (RCA) report, please reach out to email@example.com to request one. Understanding and remediating this issue has our full attention — nothing is a higher priority — and we’ll send the report to you once it’s ready.
7:55 PM PST
We've confirmed Slack should be working as expected as of 1:38 p.m. PDT, and the timestamp below has been edited to reflect this.
We're continuing efforts on our side to investigate and ensure this trouble doesn't happen again. We'll follow-up here with a summary once we have more information.
We're truly sorry for the interruption today. We appreciate your patience and understanding.
1:38 PM PST
While we don't have a further update, we're continuing to work to resolve these issues. We'll be sharing an update every hour.
1:20 PM PST
We've updated our services affected here to include all services, as all services could be impacted.
Some users may be unable to access Slack or certain features, such as seeing a "Something's gone awry" error message when using Slack calls. We'll report back soon.
12:43 PM PST
The errors we're seeing are declining, though customers may not see any changes yet. Our team is still continuing our investigation, and we'll report back soon.
12:13 PM PST
Things are trending in the right direction, but some customers may still be experiencing issues. We’ll be back soon.
11:39 AM PST
We continue to see improvement, but some users may still be experiencing delays. We'll continue to keep you updated on the progress as we know more.
11:06 AM PST
We’re seeing signs of improvement, but we’re not out of the woods yet and users may still be seeing slow performance. We’ll be back with an update in 30 minutes.
10:41 AM PST
The investigation is still ongoing, but the scope has stayed the same. We’ll update you again in a half hour.
10:06 AM PST
No new updates for the moment as we're continuing to investigate. Customers may notice that other functionality, such as Search, is also affected.
9:40 AM PST
There are no changes to report as of yet, but we're continuing to dig in on our side and will update you again in 30 minutes.
9:08 AM PST
Some users may be unable to connect to Slack, while others are still experiencing general performance issues. Our team is working to get to the bottom of this and we will share more news soon.
8:40 AM PST
We're still working to get to the bottom of the performance issues that users might be experiencing. Thanks for bearing with us and we'll continue to keep you updated as we know more.
8:05 AM PST
No new updates to share yet as our team is still digging into the issue. Users may still be experiencing slow performance around sending messages, API calls or general slowness on Slack. We’ll let you know how things are looking in another 30 minutes.
7:36 AM PST
Some users may be experiencing slowness with Slack in the desktop, browser and mobile at this time. The issue is impacting sending messages and troubles with API calls. Our team is looking into it and we'll follow up with more updates in 30 minutes.
7:05 AM PST