Introducing detectem

I've created a project named detectem, trying to solve the issues described in the previous post. Let's review its main strengths and features, as well as the roadmap in the short and long term.

Technology

detectem is an open-source project written in Python and powered by Splash, an open-source project developed by Scrapinghub to render web pages with a lot of great features, including Javascript support and a convenient API.
detectem uses Splash to render the URL and gets the list of requests/responses (as a list of HAR entries) that the browser sent and received to render the page completely.

Having the list of requests and responses gives detectem an incredible power. For instance, most current detectors do regular expressions on response body, then they could fall into false positives. For instance, a page with the following content:

<html>
It's an article about Jquery.

To install, please add to your page:

<xmp>
<script src="https://code.jquery.com/jquery-3.1.1.js"></script>
</xmp>

Then, play!
</html>

Both Wappalyzer and WhatWeb detected JQuery incorrectly since it's not used nor loaded by the web page. This kind of issue happens when you do regular expression matches without caring about context (escaped code, comments, etc). However, detectem had the right behavior because didn't detect JQuery since it works a bit different.

How it works

detectem is a command line program that detects web software and its versions. It's based on a system of plugins like WhatWeb and sets of tests.

$ det http://www.fayerwayer.com
[('nginx', '1.1.19'), ('jquery', '1.8.3'), ('moment.js', '2.8.2')]

As we see in the previous post, both Wappalyzer and WhatWeb detected only Nginx. detectem detects more software since has Javascript support provided by Splash and currently it has only 4 plugins, 3 of them were identified as present in FayerWayer.

Currently, detectem supports detection through:

Patterns in the URL
Patterns in the response body
Patterns in headers

It works in a special way to (try to) avoid false positives and be so robust as possible. I'm going to explain why in the previous example both detectors failed and detectem succeeded.

detectem works on the list of requests/responses
made by the browser to render the web page. So in the case of URL matching, it's applied on the list of requested URLs. The common case is that libraries requested by the browser are surely loaded in the DOM, then with a good probability we can say that a library is used by the website. That's the explanation why detectem didn't fail in the previous example, JQuery wasn't in the list of requests/responses.

Patterns in response body are made on the list of responses. For instance, a site using Jquery 3.1.1 has a clear signature in its file:

/*! jQuery v3.1.1 | (c) jQuery Foundation | jquery.org/license */

We could look for that to assure JQuery version. The best thing is it doesn't involve any additional requests since that content is in requests/responses list if the site uses that library.

In the absence of signatures, we can implement hash comparisons as WhatWeb does. It's in the queue to be implemented.

Tests

For this kind of software you need a strong test suite since a minor change in a matcher could leave sites undetected. For instance, it's the current JQuery plugin:


    matchers = [
        {'body': '/\*\! jQuery v(?P<version>[0-9\.]+) \| \(c\)'},
        {'body': '\* jQuery JavaScript Library v(?P<version>[0-9\.]+)'},
        {'url': '/jquery/(?P<version>[0-9\.]+)/jquery(\.min)?\.js'},
        {'url': '/jquery-(?P<version>[0-9\.]+)(\.min)?\.js'},
    ]

It uses body and url matchers. What happens when the first body matcher doesn't get the version fully but the second one does? If you move the second matcher to the first position, you could break the detection on some sites. How we can do this kind of fixing reliably? Testing our changes against the tests. For JQuery plugin these are the tests (tests/plugins/fixtures/jquery.yml):

- plugin: jquery
  matches:
    - url: http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
      version: 1.8.3
    - url: https://code.jquery.com/jquery-1.11.3.min.js
      version: 1.11.3
    - url: https://code.jquery.com/jquery-1.11.4.js
      version: 1.11.4
    - body: /*! jQuery v1.12.4 | (c) jQuery Foundation | jquery.org/license */
      version: 1.12.4
    - body: \* jQuery JavaScript Library v1.4.4
      version: 1.4.4

So if you made a change, then you run the test suite and if every test has passed it's ok.


$ py.test tests/plugins/test_generic.py --plugin jquery
===== test session starts =====
platform linux -- Python 3.5.2, pytest-3.0.3, py-1.4.31, pluggy-0.4.0
rootdir: /tmp/detectem, inifile: 
collected 5 items 

tests/plugins/test_generic.py .....

===== 5 passed in 0.19 seconds =====

Contribuiting

Adding your own plugin is easy. There are few requirements:

Be compliant with detectem.plugin.IPlugin.
Be a subclass of detectem.plugin.Plugin.
As matcher you can use functions or regular expressions with the named parameter version.

The first is an interface that enforces certain attributes to be mandatory in the plugin. The base class Plugin has some useful methods to handle plugin data. There are many examples in detectem/plugins directory.

Along with that, to create a valid plugin it's a must to provide a test file to be merged in the master branch. Let's create an example plugin right now to make it clearer. We will save the plugin at detectem/plugins/example.py.

from detectem.plugin import Plugin


class ExamplePlugin(Plugin):
    name = 'example'
    matchers = [
       {'url': 'version is v(?P<version>[0-9\.]+)'},
    ]

Let's add the test file. You have to drop a YAML file in tests/plugins/fixtures/ and it will be automatically included in the testing suite. For this case, it will be tests/plugins/fixtures/example.yml.

- plugin: example
  matches:
    - body: version is v1.1.1
      version: 1.1.1

Then, it's time to run the tests against the new plugin:

$ py.test tests/plugins/test_generic.py --plugin example
===== test session starts =====
platform linux -- Python 3.5.2, pytest-3.0.3, py-1.4.31, pluggy-0.4.0
rootdir: /tmp/detectem, inifile: 
plugins: bdd-2.18.1
collected 1 items 

tests/plugins/test_generic.py .

===== 1 passed in 0.19 seconds =====

The plugin is ready. Of course if the plugin needs more complete tests they can be created in the tests directory.

Roadmap

The main purpose of detectem is to get the software and its version. In the short-term, there are some ideas to implement:

Increment the number of supported plugins, there are good sources to do that.
Hash comparisons as in WhatWeb.
Support for special headers in the request.
Aggressive mode as in WhatWeb to get version if it's not available in current requests/responses (it's not for software discovery)
Support of lists of urls.
Create the documentation.

In the long-term:

Wait that Splash fixes cache control and move from current creation/destruction of docker container to a constant container (it will speed up detectem a lot).
Research in the field of minifiers.
Improved bundle split and single libraries detection.

It's an open source project with MIT License and you're welcome to contribute, report errors and new features/plugins in the detectem repository.