You are here

Bypassing the "testcookie" anti-webscraping protection

ivan's picture
A few days ago, I noticed that ApkTrack (an Android app I maintain) could no longer query one of the websites it usually obtains data from.
The app works mostly through web scraping and once in a while, the target websites set up new countermeasures to prevent bots from accessing their contents (even innocuous bots such as this app). In this post, we'll see how the protection I encountered this week-end was bypassed.

It all began when I noticed that a website (whose identity will not be disclosed) returned the following script in lieu of the expected data:


It's plain to see that this script uses a slow AES implementation to generate a cookie required to browse the target website. I notice that the  ,   and   variables of the above script change with every try, and while they kind of look like MD5 hashes, none of them can be reversed easily. Time to dig in.
Ideally, I'd like to read the code which generates these values. I'm in luck: a quick search points me to an nginx module called testcookie.

Reading through the 2000-something lines of code is made difficult by the numerous macros coming from nginx, but I understand the following:

  •   and   are the key and initialization vector (respectively) used for the AES-CBC computation ;   is the data to decipher.
  • The latter is generated the following way:  , those two variables being defined in the nginx configuration. More precisely:
    • According to the documentation,   can either be the visitor's IP address (i.e.  ), or their IP concatenated with the browser's user-agent (i.e.  ). This part is predictable and can be generated easily.
    •   however is an unknown value. It can be fixed, or random (in which case it changes every time the web server is rebooted).

There are basically two ways to bypass this protection. The first way would be to run the javascript code just like a browser would. The second way is to somehow guess what the cookie's value is expected to be. The former implies a lot of overhead in my tiny Android app, so I start looking into the latter.
I need to find out how the   is generated on the target website, since it is configuration-dependant. That part is easy: I take another browser, navigate to the website and compare the cookies: they're identical. This means that only the IP address is used<;
Next, I have to guess  's value. We face the following equation:

  • I know a valid cookie just by visiting the website:  .
  • My IP address at the time was  .
  • We have established that  .

This is a textbook bruteforce situation. I fireup Hashcat:


The   option corresponds to a hybrid attack, which means that every word from the dictionary is prefixed with an arbitrary string (here, my IP address). After a while, Hashcat proudly announces the result:  .
I actually guessed that value before the bruteforce had ended for a simple reason:   is the example value given in the documentation and I had tested it manually. When in doubt, always assume the sysadmin was lazy.

We now have everything needed to forge our cookies, and computing a MD5 hash before each request is all it takes to bypass the protection.

EDIT : Following this post,  's minimum size has been increased to 32 characters in the latest version of the script.