Andrew Moore

Solutions Architect @ _nventive; Desktop, mobile and web developer; Tech enthusiast.

"RT @firefox: We strive to make Firefox a great experience. Last weekend we failed, and we’re sorry. More here, but one call out: if you en…"

Share


Archives


Categories


Steam’s Caching Disaster

An outside perspective on the technical aspects of Valve's networking incident of Dec 25, 2015 that led to private profile information of select customers being publicly available on the storefront.

Andrew MooreAndrew Moore

Christmas day is often a very busy day for online services. Thousands of new customers usually flood these services trying to activate new products they received as gifts. Network staff on hand have to quickly react to the changing condition of their infrastructure to ensure that it stays up during that period. A downed service often gives a bad first impression to a new customer.

Steam is no exception to this phenomenon. Unfortunately, a bad reconfiguration of their caching proxies by Valve employees on Dec 25, 2015, led to private customer information being served publicly during a period of approximately 1 hour and 15 minutes.

Server-Side Caching

In large-scale systems, caching proxies are often used to accelerate the application. They prevent application servers from being hit with requests for content that seldom changes.

During a normal HTTP transaction, the client requests a resource to the application server. The application server then does its normal processing go build the page. This usually includes:

It then sends the rendered HTML to the client.

Adding a caching proxy between the application server and the client prevents the application server to have to do all this work for every request. Instead, the client requests a resource to the caching proxy. If the caching proxy has a cached version of the requested resource that isn’t considered stale (cached resources have expiry rules), it will serve it directly to the client. If not, it will ask the application server for it and then serve that copy.

However, not all resources are cached by the proxy. Authenticated requests are usually not cached as the result is usually different for each user. Same goes with form submissions (POST requests). It is possible using Edge Server Includes to cache only part of a page; this allows you, for example, go cache article content without affecting the authenticated header. The cacheability of pages is usually determined by cache control HTTP headers.

Caching proxies are cheaper to deploy compared to full application servers. This makes it easy then to deploy multiple caching proxies around the world to rapidly respond to requests regardless of the user’s location.

The Incident

Mid-afternoon on Dec 25, Steam went down due to load. Network staff made a change to the caching proxies in the hopes to bring the service back up.

Unfortunately, the change caused ALL pages (regardless of their cacheability headers) to be cached by the caching proxies. This meant that users requesting resources were seeing requests served to other authenticated users. The bad configuration was live for an hour and 15 minutes before Valve reacted and turned the servers off.

This type of configuration error is pretty common in the industry. However in almost every case, the faulty config is caught and corrected before it is graduated to the production servers.

Valve unfortunately failed to catch the error before graduating it to prod or simply did not test before deploying. The result: personal information from select users were publicly available for anyone to see. That information included:

It is, however, due to the nature of how Valve protects forms and sessions on Steam, very unlikely that any action such as buying games on an account other than their own user account was possible during the incident.

Valve’s Failures

Valve’s Statement

Since writing this blog post, Valve has released a technical explanation of what went wrong on Steam’s website. It is in line with my perception of the problem.

We’d like to follow up with more information regarding Steam’s troubled Christmas.

What happened

On December 25th, a configuration error resulted in some users seeing Steam Store pages generated for other users. Between 11:50 PST and 13:20 PST store page requests for about 34k users, which contained sensitive personal information, may have been returned and seen by other users.

The content of these requests varied by page, but some pages included a Steam user’s billing address, the last four digits of their Steam Guard phone number, their purchase history, the last two digits of their credit card number, and/or their email address. These cached requests did not include full credit card numbers, user passwords, or enough data to allow logging in as or completing a transaction as another user.

If you did not browse a Steam Store page with your personal information (such as your account page or a checkout page) in this time frame, that information could not have been shown to another user.

Valve is currently working with our web caching partner to identify users whose information was served to other users, and will be contacting those affected once they have been identified. As no unauthorized actions were allowed on accounts beyond the viewing of cached page information, no additional action is required by users.

How it happened

Early Christmas morning (Pacific Standard Time), the Steam Store was the target of a DoS attack which prevented the serving of store pages to users. Attacks against the Steam Store, and Steam in general, are a regular occurrence that Valve handles both directly and with the help of partner companies, and typically do not impact Steam users. During the Christmas attack, traffic to the Steam store increased 2000% over the average traffic during the Steam Sale.

In response to this specific attack, caching rules managed by a Steam web caching partner were deployed in order to both minimize the impact on Steam Store servers and continue to route legitimate user traffic. During the second wave of this attack, a second caching configuration was deployed that incorrectly cached web traffic for authenticated users. This configuration error resulted in some users seeing Steam Store responses which were generated for other users. Incorrect Store responses varied from users seeing the front page of the Store displayed in the wrong language, to seeing the account page of another user.

Once this error was identified, the Steam Store was shut down and a new caching configuration was deployed. The Steam Store remained down until we had reviewed all caching configurations, and we received confirmation that the latest configurations had been deployed to all partner servers and that all cached data on edge servers had been purged.

We will continue to work with our web caching partner to identify affected users and to improve the process used to set caching rules going forward. We apologize to everyone whose personal information was exposed by this error, and for interruption of Steam Store service.

Video Explanation

While I was writing this article, Tom Scott released a short video to explain the issue.

Solutions Architect @ _nventive; Desktop, mobile and web developer; Tech enthusiast.