Wednesday August 24, 2016

Incident

Investigating intermittent failures

All Clear: In the course of resolving the telemetry issue and the downstream database issue, we uncovered a capacity issue in our ability to serve files to you. We right-quick addressed that, and Slack should be snappy and well. Sorry for the bumpy ride!

10:11 AM PST

At last we've observed the desired changes to the behavior of our critical databases and their caches following the fix we just deployed. We believe problems connecting to Slack, using the website, and really anything else are now resolved. New telemetry about client memory usage was ultimately the straw that broke the database's back. We're continuing to watch over the situation and are working to prevent similar situations in the future.

9:02 AM PST

Intermittent problems with all aspects of Slack continue but we believe we understand the problem and are deploying a fix now that should restore service to all. We're monitoring the situation closely as this change is deployed.

8:42 AM PST

We're still observing the same kinds of failures as we have been for the last 90 minutes or so. We believed we'd solved the problem but are still getting to the bottom of it. At present we are tracking down why we are overwhelming the caches designed to keep fetching information about your users and teams fast. More updates soon.

8:12 AM PST

We are continuing to monitor the situation caused by a still-undiagnosed deadlock in one of our critical databases this morning. At first the failures were intermittent but for the last 10 minutes the problems have prevented some people from logging into Slack. We believe the problem is fully addressed and are continuing to search for the contributing factors.

7:50 AM PST

We've tracked the problems that began to manifest as file upload and serving errors. The problem has unfortunately grown so some folks may experience problems signing into Slack.

7:41 AM PST

We're investigating reports from our monitoring of intermittent file upload failures and will update when we have more information to share.

7:00 AM PST

Status

Incident