You are here

Bypassing the "testcookie" anti-webscraping protection

ivan's picture
A few days ago, I noticed that ApkTrack (an Android app I maintain) could no longer query one of the websites it usually obtains data from.
The app works mostly through web scraping and once in a while, the target websites set up new countermeasures to prevent bots from accessing their contents (even innocuous bots such as this app). In this post, we'll see how the protection I encountered this week-end was bypassed.

It all began when I noticed that a website (whose identity will not be disclosed) returned the following script in lieu of the expected data:

    <script type="text/javascript" src="/aes.min.js"></script>
        function toNumbers(d) {
            var e = [];
            d.replace(/(..)/g, function(d) {
                e.push(parseInt(d, 16))
            return e
        function toHex() {
            for (var d = [], d = 1 == arguments.length && arguments[0].constructor == Array ? arguments[0] : arguments, e = "", f = 0; f < d.length; f++) e += (16 > d[f] ? "0" : "") + d[f].toString(16);
            return e.toLowerCase()
        var a = toNumbers("5d026cff5942d1ab28e3757e4b2e2f87"),
            b = toNumbers("845dd1e672b840c246aa8cfe9b5d3632"),
            c = toNumbers("e48176221e1325e09b9a959370446f05");
        var now = new Date(),
            time = now.getTime();
        time += 3600 * 1000 * 24;
        document.cookie = "BKS=" + toHex(slowAES.decrypt(c, 2, a, b)) + "; expires=" + now.toUTCString() + "; path=/";
        location.href = "";

It's plain to see that this script uses a slow AES implementation to generate a cookie required to browse the target website. I notice that the a, b and c variables of the above script change with every try, and while they kind of look like MD5 hashes, none of them can be reversed easily. Time to dig in.
Ideally, I'd like to read the code which generates these values. I'm in luck: a quick search points me to an nginx module called testcookie.

Reading through the 2000-something lines of code is made difficult by the numerous macros coming from nginx, but I understand the following:

  • a and b are the key and initialization vector (respectively) used for the AES-CBC computation ; c is the data to decipher.
  • The latter is generated the following way: c = AES(MD5($testcookie_session + $testcookie_secret)), those two variables being defined in the nginx configuration. More precisely:
    • According to the documentation, testcookie_session can either be the visitor's IP address (i.e., or their IP concatenated with the browser's user-agent (i.e. (X11; Ubuntu; Linux x86_64; [...]). This part is predictable and can be generated easily.
    • testcookie_secret however is an unknown value. It can be fixed, or random (in which case it changes every time the web server is rebooted).

There are basically two ways to bypass this protection. The first way would be to run the javascript code just like a browser would. The second way is to somehow guess what the cookie's value is expected to be. The former implies a lot of overhead in my tiny Android app, so I start looking into the latter.
I need to find out how the testcookie_session is generated on the target website, since it is configuration-dependant. That part is easy: I take another browser, navigate to the website and compare the cookies: they're identical. This means that only the IP address is used Next, I have to guess testcookie_secret's value. We face the following equation:

  • I know a valid cookie just by visiting the website: 64534e58cbc178830089d06de12c00ed.
  • My IP address at the time was
  • We have established that 64534e58cbc178830089d06de12c00ed = MD5("" + testcookie_secret).

This is a textbook bruteforce situation. I fireup Hashcat:

PS C:\Users\Ivan\oclHashcat-1.33> .\oclHashcat64.exe -m0 .\targets\site.txt -a7 .\dicts\wordlist.txt
oclHashcat v1.33 starting...

The a7 option corresponds to a hybrid attack, which means that every word from the dictionary is prefixed with an arbitrary string (here, my IP address). After a while, Hashcat proudly announces the result: testcookie_secret = keepmesecret.
I actually guessed that value before the bruteforce had ended for a simple reason: keepmesecret is the example value given in the documentation and I had tested it manually. When in doubt, always assume the sysadmin was lazy.

We now have everything needed to forge our cookies, and computing a MD5 hash before each request is all it takes to bypass the protection.

EDIT : Following this post, testcookie_secret's minimum size has been increased to 32 characters in the latest version of the script.

Error | Borderline


Error message

  • Warning: Cannot modify header information - headers already sent by (output started at /var/ in drupal_send_headers() (line 1551 of /var/
  • Error: Call to undefined function each() in SMTP->Data() (line 393 of /var/
The website encountered an unexpected error. Please try again later.