Monday October 5, 2020

Incident

Trouble processing messages with the RTM API

Issue summary:
On October 5, 2020 around 6:50 p.m. PDT, some customers began experiencing trouble processing messages with the RTM (Real Time Messaging) API. This included issues interacting with apps and bots.

This was related to the outage we were experiencing during this time. After the outage was resolved, we noticed lingering errors in applications using the RTM API which were caused by the earlier disruption. A fix was deployed on October 5, 2020 around 10:30 p.m. PDT.

6:15 PM PST

We’ve resolved the issue with the RTM API and users should no longer be having trouble. We’re sorry for the disruption this has caused.

11:40 PM PST

We don't have any new information to share at the moment. We'll be back with an update as soon as we do. Thank you for your patience.

8:27 PM PST

Our investigation of the RTM API issue is still ongoing. We will provide another update in 30 minutes.

7:56 PM PST

No additional news to share at this stage, but we’re focused on getting things back to normal as quickly as possible. We’ll be back with another update in 30 minutes.

7:26 PM PST

Some customers may be experiencing trouble processing messages with the RTM API, including issues interacting with apps and bots. We apologise for the disruption. We are investigating and will work to resolve this as fast as we can.

6:56 PM PST

Services affected

Apps/Integrations/APIs

Status

Incident

Outage

Users are experiencing degraded performance across devices and may be unable to connect

Issue summary:
Starting on October 5, 2020 at 5:58 a.m. PDT, some customers experienced degraded functionality in Slack due to slow API performance. Since many of Slack’s features rely on making API calls, customers experienced latency in almost all aspects of our service. This included, but wasn’t limited to, the sending and loading of messages in threads and channels, as well as searching, calling, and accessing administrative features.

We initially identified an overloaded database memory cache server as a source of instability. At 8:10 a.m. PDT, we provisioned a replacement server for this host. When it came online, it unexpectedly caused additional instability and further degraded the performance and availability of our API. The replacement server came fully online with no residual instability in our database memory cache layer by 8:52 a.m. PDT.

While we were working on the previous memory cache issue, our team also began looking into problems with our internal service discovery system. This system is responsible for routing traffic within our infrastructure to the multiple services that support various features of Slack. We made several changes to this system over the course of the next several hours to remediate the impact of its instability. Throughout this time, our service was degraded, but was still available.

At 1:25 p.m. PDT, one of these configuration changes inadvertently degraded Slack’s availability further and triggered an outage. We immediately identified and deployed a fix for this issue, which finished rolling out to our server fleet at 1:38 p.m. PDT. At this point, we determined the system was fully recovered and operating normally for all customers.

Around 5:00 p.m. PDT, we received reports about portions of the product that were still impacted by ripple effects from the earlier disruption in our service discovery system. This included search, which we remedied at 5:40 p.m. PDT. We also saw lingering errors in applications using our Real Time Messaging (RTM) API, and resolved those problems by 10:30 p.m. PDT with similar remediation steps.

Our teams have been working around the clock on ensuring that Slack remains available, fast, and reliable. While we haven’t yet resolved all the underlying nuances of the degradation in our service discovery system, we’re confident that we have a path forward and that we have proper mitigations in place to continue this work. We don’t expect any additional customer impact due to this issue. However, if similar problems do arise, we’ll provide updates on this site. If you would like to receive a full Root Cause Analysis (RCA) report, please reach out to feedback@slack.com to request one. Understanding and remediating this issue has our full attention — nothing is a higher priority — and we’ll send the report to you once it’s ready.

7:55 PM PST

We've confirmed Slack should be working as expected as of 1:38 p.m. PDT, and the timestamp below has been edited to reflect this.

We're continuing efforts on our side to investigate and ensure this trouble doesn't happen again. We'll follow-up here with a summary once we have more information.

We're truly sorry for the interruption today. We appreciate your patience and understanding.

1:38 PM PST

While we don't have a further update, we're continuing to work to resolve these issues. We'll be sharing an update every hour.

1:20 PM PST

We've updated our services affected here to include all services, as all services could be impacted.

Some users may be unable to access Slack or certain features, such as seeing a "Something's gone awry" error message when using Slack calls. We'll report back soon.

12:43 PM PST

The errors we're seeing are declining, though customers may not see any changes yet. Our team is still continuing our investigation, and we'll report back soon.

12:13 PM PST

Things are trending in the right direction, but some customers may still be experiencing issues. We’ll be back soon.

11:39 AM PST

We continue to see improvement, but some users may still be experiencing delays. We'll continue to keep you updated on the progress as we know more.

11:06 AM PST

We’re seeing signs of improvement, but we’re not out of the woods yet and users may still be seeing slow performance. We’ll be back with an update in 30 minutes.

10:41 AM PST

The investigation is still ongoing, but the scope has stayed the same. We’ll update you again in a half hour.

10:06 AM PST

No new updates for the moment as we're continuing to investigate. Customers may notice that other functionality, such as Search, is also affected.

9:40 AM PST

There are no changes to report as of yet, but we're continuing to dig in on our side and will update you again in 30 minutes.

9:08 AM PST

Some users may be unable to connect to Slack, while others are still experiencing general performance issues. Our team is working to get to the bottom of this and we will share more news soon.

8:40 AM PST

We're still working to get to the bottom of the performance issues that users might be experiencing. Thanks for bearing with us and we'll continue to keep you updated as we know more.

8:05 AM PST

No new updates to share yet as our team is still digging into the issue. Users may still be experiencing slow performance around sending messages, API calls or general slowness on Slack. We’ll let you know how things are looking in another 30 minutes.

7:36 AM PST

Some users may be experiencing slowness with Slack in the desktop, browser and mobile at this time. The issue is impacting sending messages and troubles with API calls. Our team is looking into it and we'll follow up with more updates in 30 minutes.

7:05 AM PST

Services affected

Login/SSO

Connectivity

Messaging

Link Previews

Files

Notifications

Huddles

Search

Apps/Integrations/APIs

Workspace/Org Administration

Status

Outage