Extending Selenium with Image recognition

Posted on Sun 12 June 2016 in Selenium

What do we have?

Selenium is the de-facto tool for functional web tests. Yet, Selenium has its own limitations. Standard API allows interacting only with the browser, and it's hard to test image based applications using Selenium.

In most of the cases, Selenium capabilities meet requirements of functional web application test. Standard test will be able to perform operations inside the browser:

  • locate elements by the selector,
  • retrieve their state,
  • perform actions on UI.

But how to test an application which uses canvas, flash or complex DOM tree? An example of such application could be Pixlr, Ace editor, any maps application or many other rich web applications. In our case, an application used browser based VNC client.


Providing a way to test web application which uses VNC client is not the only requirement. We have a bunch of extra requirements:

  • use Selenium to write part of the test which doesn't involve VNC client test;
  • be able to execute tests in many threads on different machines to speed up execution.

It is clear that Selenium with its default features can't provide us a way to test visual content on the page. Yet, the available set of features provided by Selenium is extremely important to our requirements. Our team is using Selenium Grid to run tests distributively. We were likely to re-use the same environment for our future tests.

It's important to understand how tests are executed using Selenium grid.

When test wants to run on a Selenium grid it should pass configuration requirements for a node. Selenium Hub knows all available node configurations. Using this information is selects appropriate node from the list and opens the session. After the session is opened, test initiates Selenium commands and sends them to the Hub. Hub passes commands to the node assigned to that test. Node runs the browser and executes the commands within that browser against the application under test.

It is possible to extend Selenium Hub and Node with custom plugins. This can be done by creating a custom servlet and registering it inside the configuration file.

Sikuli as a missing part

There is a reason why Sikuli is a good tool for solving our problem. It can automate anything you see on the screen. Using image recognition we would be able to locate GUI elements inside VNC session. Sikuli has Java API which additionally provides mouse and keyboard access. This is enough to implement required GUI element interaction.

Idea was to extend each node with a custom plugin which would use Sikuli Java API.

Tests would send remote commands to the hub that will redirect them to the node where the client resides. This implementation required us to develop:

Sikuli limitations

Sikuli demands real screen or virtual frame buffer. Without it, Sikuli couldn't make any image recognition. It also means that having overlapping windows on the screen would fail image search.

This was against Selenium Node default configuration we had used. In it, a browser is brought to the front on every action. There is constant overlapping happening in the case of many open browsers.

To solve this problem we have configured each Selenium Node to have max session count equal to 1.

Max session is a parameter that tells how many instances of the browser can run in parallel on Selenium Node.

Additionally, we have assigned dedicated display for each Selenium Node process in VM. As a result, we had virtual machines with many Selenium node processes. Each process can have only single browser session on a dedicated screen.

Sikuli needs image resources be present on the file system where script execution happens. This additionally required us to develop file uploading extension. It accepts compressed file archive, extracts it to the random directory and returns path. The path is later used as a prefix to locate images for Sikuli commands.

An additional problem was related to Selenium sessions. Each time Selenium command passes through the hub it updates selenium session. In our case requests were bypassing this default behaviour. Yet, it was easy to fix by touching session object each time our custom hub extension redirects request to the node.

Example of Selenium test with image recognition

Here is an example of a test using Selenium with Sikuli extension. This code contains few custom wrapper classes to provide an abstraction of UI element and implicit retries during lookup. This is for cases when UI is slower than our test.

As a first step define capabilities for the browser. Note custom capability sikuliExtension set to true. With this statement test requests Hub for a node which has Sikuli extension installed.

private DesiredCapabilities desiredCapabilitiesForSeleniumNode() {
    DesiredCapabilities desiredCapabilities = new DesiredCapabilities();
    desiredCapabilities.setCapability("sikuliExtension", true);
    return desiredCapabilities;

Next part is to construct Sikuli extension client and upload images bundle.

SikuliExtensionClient sikuliExtensionClient = new SikuliExtensionClient(
        GridSettings.HOST, GridSettings.PORT, remoteWebDriverSessionId);


SikuliHelper sikuliHelper = new SikuliHelper(sikuliExtensionClient);

And the last part is element lookup and interaction. In this case, image name acts as an element selector.

TextBox editor = sikuliHelper.findTextBox("js_body.png");





By extending Selenium with image recognition feature we were able to create tests for complex UI part of our application.

There were no need to rewrite any existing test step written using Selenium. It also saved us from setting up new infrastructure for a different tool as we could re-use existing Selenium Grid. Which removed efforts on future maintenance activities.

You can find these extensions at sterodium.io