WSL's introduction with the Windows 10 Anniversary update might just be the occasion you've been waiting for to get rid of that...
Steam’s Caching Disaster
An outside perspective on the technical aspects of Valve's networking incident of Dec 25, 2015 that led to private profile information of select customers being publicly available on the storefront.
Christmas day is often a very busy day for online services. Thousands of new customers usually flood these services trying to activate new products they received as gifts. Network staff on hand have to quickly react to the changing condition of their infrastructure to ensure that it stays up during that period. A downed service often gives a bad first impression to a new customer.
Steam is no exception to this phenomenon. Unfortunately, a bad reconfiguration of their caching proxies by Valve employees on Dec 25, 2015, led to private customer information being served publicly during a period of approximately 1 hour and 15 minutes.
Server-Side Caching
In large-scale systems, caching proxies are often used to accelerate the application. They prevent application servers from being hit with requests for content that seldom changes.
During a normal HTTP transaction, the client requests a resource to the application server. The application server then does its normal processing go build the page. This usually includes:
- Determining what content in what language to fetch
- Querying the database for content
- Applying business rules to the content
- Formatting/transforming the content
It then sends the rendered HTML to the client.
Adding a caching proxy between the application server and the client prevents the application server to have to do all this work for every request. Instead, the client requests a resource to the caching proxy. If the caching proxy has a cached version of the requested resource that isn’t considered stale (cached resources have expiry rules), it will serve it directly to the client. If not, it will ask the application server for it and then serve that copy.
However, not all resources are cached by the proxy. Authenticated requests are usually not cached as the result is usually different for each user. Same goes with form submissions (POST
requests). It is possible using Edge Server Includes to cache only part of a page; this allows you, for example, go cache article content without affecting the authenticated header. The cacheability of pages is usually determined by cache control HTTP headers.
Caching proxies are cheaper to deploy compared to full application servers. This makes it easy then to deploy multiple caching proxies around the world to rapidly respond to requests regardless of the user’s location.
The Incident
Mid-afternoon on Dec 25, Steam went down due to load. Network staff made a change to the caching proxies in the hopes to bring the service back up.
Unfortunately, the change caused ALL pages (regardless of their cacheability headers) to be cached by the caching proxies. This meant that users requesting resources were seeing requests served to other authenticated users. The bad configuration was live for an hour and 15 minutes before Valve reacted and turned the servers off.
This type of configuration error is pretty common in the industry. However in almost every case, the faulty config is caught and corrected before it is graduated to the production servers.
Valve unfortunately failed to catch the error before graduating it to prod or simply did not test before deploying. The result: personal information from select users were publicly available for anyone to see. That information included:
- Funds available in user’s Steam wallet
- Purchase history
- Licenses and product key activations
- User’s country
- Saved payment methods (last 2 digits of saved credit card or Paypal email address)
- Email address
- Last four digits of the user’s mobile phone
It is, however, due to the nature of how Valve protects forms and sessions on Steam, very unlikely that any action such as buying games on an account other than their own user account was possible during the incident.
Valve’s Failures
- Failure to detect the issue in a timely fashion
Proper application health monitoring checks would have identified this issue within a minute of the faulty configuration being deployed. Usually, checks are made on a regular basis to ensure that authenticated requests are not cached erroneously. Valve obviously does not have a check for this scenario as part of their monitoring suite. -
Changes made directly in production
Changes were made directly to production servers without testing on a staging environment before. While I understand that there are situations where emergency fixes have to be deployed without going thru the normal graduation procedure, far-reaching changes such as caching server configurations (which should seldom happen) shouldn’t be deployed straight to production without some testing beforehand. -
Failure to communicate with their customers
Valve has released a statement that does very little to reassure the users whose information was publicly available.
Valve’s Statement
Since writing this blog post, Valve has released a technical explanation of what went wrong on Steam’s website. It is in line with my perception of the problem.
We’d like to follow up with more information regarding Steam’s troubled Christmas.
What happened
On December 25th, a configuration error resulted in some users seeing Steam Store pages generated for other users. Between 11:50 PST and 13:20 PST store page requests for about 34k users, which contained sensitive personal information, may have been returned and seen by other users.
The content of these requests varied by page, but some pages included a Steam user’s billing address, the last four digits of their Steam Guard phone number, their purchase history, the last two digits of their credit card number, and/or their email address. These cached requests did not include full credit card numbers, user passwords, or enough data to allow logging in as or completing a transaction as another user.
If you did not browse a Steam Store page with your personal information (such as your account page or a checkout page) in this time frame, that information could not have been shown to another user.
Valve is currently working with our web caching partner to identify users whose information was served to other users, and will be contacting those affected once they have been identified. As no unauthorized actions were allowed on accounts beyond the viewing of cached page information, no additional action is required by users.
How it happened
Early Christmas morning (Pacific Standard Time), the Steam Store was the target of a DoS attack which prevented the serving of store pages to users. Attacks against the Steam Store, and Steam in general, are a regular occurrence that Valve handles both directly and with the help of partner companies, and typically do not impact Steam users. During the Christmas attack, traffic to the Steam store increased 2000% over the average traffic during the Steam Sale.
In response to this specific attack, caching rules managed by a Steam web caching partner were deployed in order to both minimize the impact on Steam Store servers and continue to route legitimate user traffic. During the second wave of this attack, a second caching configuration was deployed that incorrectly cached web traffic for authenticated users. This configuration error resulted in some users seeing Steam Store responses which were generated for other users. Incorrect Store responses varied from users seeing the front page of the Store displayed in the wrong language, to seeing the account page of another user.
Once this error was identified, the Steam Store was shut down and a new caching configuration was deployed. The Steam Store remained down until we had reviewed all caching configurations, and we received confirmation that the latest configurations had been deployed to all partner servers and that all cached data on edge servers had been purged.
We will continue to work with our web caching partner to identify affected users and to improve the process used to set caching rules going forward. We apologize to everyone whose personal information was exposed by this error, and for interruption of Steam Store service.
Video Explanation
While I was writing this article, Tom Scott released a short video to explain the issue.