Performance testing Firefox OS on reference devices

A while back I wrote about the LEGO harness I created for Eideticker to hold both the device and camera in place. Since then there has been a couple of iterations of the harness. When we started testing against our low-cost prototype device, the harness needed modifying due to the size difference and position of the USB socket. At this point I tried to create a harness that would fit all of our current devices, with the hope of avoiding another redesign.

Eideticker harness v2.0 If you’re interested in creating one of these yourself, here’s the LEGO Digital Designer file and building guide.

Unfortunately, when I first got my hands on our reference device (codenamed ‘Flame’) it didn’t fit into the harness. I had to go back to the drawing board, and needed to be a little more creative due to the width not matching up too well with the dimensions of LEGO bricks. In the end I used some slope bricks (often used for roof tiles) to hold the device securely. A timelapse video of constructing the latest harness follows.

 

 

We now are 100% focused on testing against our reference device, so in London we have two dedicated to running our Eideticker tests, as shown in the photo below.

Eideticker harness for FlameAgain, if you want to build one of these for yourself, download the LEGO Digital Designer file and building guide. If you want to learn more about the Eideticker project check out the project page, or if you want to see the dashboard with the latest results, you can find it here.

Hunting for performance regressions in Firefox OS

At Mozilla we’re running performance tests against Firefox OS devices several times a day, and you can see these results on our dashboard. Unfortunately it takes a while to run these tests, which means we’re not able to run them against each and every push, and therefore when a regression is detected we can have a tough time determining the cause.

We do of course have several different types of performance testing, but for the purposes of this post I’m going to focus on the cold launch of applications measured by b2gperf. This particular test launches 15 of the packaged applications (each one is launched 30 times) and measures how long it takes. Note that this is how long it takes to launch the app, and not how long it takes for the app to be ready to use.

In order to assist with tracking down performance regressions I have written a tool to discover any Firefox OS builds generated after the last known good revision and before the first known bad revision, and trigger additional tests to fill in the gaps. The results are sent via e-mail for the recipient to review and either revise the regression range or (hopefully) identify the commit that caused the regression.

Before I talk about how to use the tool, there’s a rather important prerequisite to using it. As our continuous integration solution involves Jenkins, you will need to have access to an instance with at least one job configured specifically for this purpose.

The simplest approach is to use our Jenkins instance, which requires Mozilla-VPN access and access to our tinderbox builds. If you have these you can use the instance running at http://selenium.qa.mtv2.mozilla.com:8080 and the b2g.hamachi.perf job.

Even if you have the access to our Jenkins instance and the device builds, you may still want to set up a local instance. This will allow you to run the tests without tying up the devices we have dedicated to running these tests, and you wont be contending for resources. If you’re going to set up a local instance you will of course need at least one Firefox OS device and access to tinderbox builds for the device.

You can download the latest long-term support release (recommended) of Jenkins from here. Once you have that, run java -jar jenkins.war to start it up. You’ll be able to see the dashboard at http://localhost:8080 where you can create a new job. The job must accept the following parameters, which are sent by the command line tool when it triggers jobs.

BUILD_REVISION – This will be populated with the revision of the build that will be tested.
BUILD_TIMESTAMP – A formatted timestamp of the selected build for inclusion in the e-mail notification.
BUILD_LOCATION – The URL of build to download.
APPS – A comma separated names of the applications to test.
NOTIFICATION_ADDRESS – The e-mail address to send the results to.

Your job can then use these parameters to run the desired tests. There are a few things I’d recommend, which we’re using for our instance. If you have access to our instance it may also make sense to use the b2g.hamachi.perf job as a template for yours:

  • Install the Workspace Cleanup plugin, and wipe out the workspace before your build starts. This will ensure that no artifacts left over from a previous build will affect your results.
  • Use the Build Timeout plugin with a reasonable timeout to prevent a failing device flash to stall indefinitely.
  • It’s likely that the job will download the build referenced in $BUILD_LOCATION so you’ll need to make sure you include a valid username and password. You can inject passwords to the build as environment variables to prevent them from being exposed.
  • The build files often include a version number, which you won’t want to hard-code as it will change every six weeks. The following shell code uses wget to download the file using a wildcard:
  • Depending on the tests you’ll be running, you’ll most likely want to split the $APPS variable and run your main command against each entry. The following shell script shows how we’re doing this for running b2gperf:
  • With the Email-ext plugin, you can customise the content and triggers for the e-mail notifications. For our instance I have set it to always trigger, and to attach the console log. For the content, I have included the various parameters as well as used the following token to extract the b2gperf results: ${BUILD_LOG_REGEX, regex=".* Results for (.*)", maxMatches=0, showTruncatedLines=false, substText="$1"}

Once you have a suitable Jenkins instance and job available, you can move onto triggering your tests. The quickest way to install the b2ghaystack tool is to run the following in a terminal:

Note that this requires you to have Python and Git installed. I would also recommend using virtual environments to avoid polluting your global site-packages.

Once installed, you can get the full usage by running b2ghaystack --help but I’ll cover most of these by providing the example of taking a real regression identified on our dashboard and narrowing it down using the tool. It’s worth calling out the --dry-run argument though, which will allow you to run the tool without actually triggering any tests.

The tool takes a regression range and determines all of the pushes that took place within the range. It will then look at the tinderbox builds available and try to match them up with the revisions in the pushes. For each of these builds it will trigger a Jenkins job, passing the variables mentioned above (revision, timestamp, location, apps, e-mail address). The tool itself does not attempt to analyse the results, and neither does the Jenkins job. By passing an e-mail address to notify, we can send an e-mail for each build with the test results. It is then up to the recipient to review and act on them. Ultimately we may submit these results to our dashboard, where they can fill in the gaps between the existing results.

The regression I’m going to use in my example was from February, where we actually had an issue preventing the tests for running for a week. When the issue was resolved, the regression presented itself. This is an unusual situation, but serves as a good example given the very wide regression range.

Below you can see a screenshot of this regression on our B2G dashboard. The regression is also available to see on our generic dashboard.

Perf-o-Matic__Build_20140325004002_
Performance regression shown on B2G dashboard

It is necessary to determine the last known good and first known bad gecko revisions in order to trigger tests for builds in between these two points. At present, the dashboard only shows the git revisions for our builds, but we need to know the mercurial equivalents (see bug 979826). Both revisions are present in the sources.xml available alongside the builds, and I’ve been using this to translate them.

For our regression, the last known good revision was 07739c5c874f from February 10th, and the first known bad was 318c0d6e24c5 from February 17th. I first ran this against the mozilla-central branch:

-b mozilla-central specifies the target branch to discover tinderbox builds for.
--eng means the builds selected will have the necessary tools to run my tests.
-a Settings limits my test to just the Settings app, as it’s one of the affected apps, and means my jobs will finish much sooner.
-u username and -p password are my credentials for accessing the device builds.
-j http://localhost:8080 is the location of my Jenkins instance.
-e dhunt@mozilla.com is where I want the results to be sent.
hamachi is the device I’m testing against.
b2g.hamachi.perf is the name of the job I’ve set up in Jenkins. Finally, the last two arguments are the good and bad revisions as determined previously.

This discovered 41 builds, but to prevent overloading Jenkins the tool only triggers a maximum of 10 builds (this can be overridden using the -m command line option). The ten builds are interspersed from the 41, and had the range of f98c5c2d6bba:4f9f58d41eac.

Here’s an example of what the tool will output to the console:

None of these builds replicated the issue, so I took the last revision, 4f9f58d41eac and ran again in case there were more builds appropriate but previously skipped due to the maximum of 10:

This time no builds matched, so I wasn’t going to be able to reduce the regression range using the mozilla-central tinderbox builds. I move onto the mozilla-inbound builds, and used the original range:

Again, no builds matched. This is most likely because we only retain the mozilla-inbound builds for a short time. I moved onto the b2g-inbound builds:

This found a total of 187 builds within the range 932bf66bc441:9cf71aad6202, and 10 of these ran. The very last one replicated the regression, so I ran again with the new revisions:

This time there were 14 builds, and 10 ran. The penultimate build replicated the regression. Just in case I could narrow it down further, I ran with the new revisions:

No builds matched, so I had my final regression range. The last good build was with revision b2085eca41a9 and the first bad build was with revision e9055e7476f1. This results in a pushlog with just four pushes.

Of these pushes, one stood out as a possible cause for the regression: Bug 970895: Use I/O loop for polling memory-pressure events, r=dhylands The code for polling sysfs for memory-pressure events currently runs on a separate thread. This patch implements this functionality for the I/O thread. This unifies the code base a bit and also safes some resources.

It turns out this was reverted for causing bug 973824, which was a duplicate of bug 973940. So, regression found!

Here’s an example of the notification e-mail content that our Jenkins instance will send:

Hopefully this tool will be useful for determining the cause for regressions much sooner than we are currently capable of doing. I’m sure there are various improvements we could make to this tool – this is very much a first iteration! Please file bugs and CC or needinfo me (:davehunt), or comment below if you have any thoughts or concerns.

Building a harness for Eideticker… with LEGO

Since July, I’ve started to get involved with the Eideticker project, which aims to measure response times and frame rates for both Firefox for Android and Firefox OS. I’ve mostly been involved with the Firefox OS work, which involves pointing a camera at a mobile device while tests run, and then processing the captured video.

Eideticker components
All the components including the prototype phone case before I started building the replacement.

One of the frustrating challenges is setting up the device and camera so they’re suitably positioned for the capture. The camera has a standard tripod mount, so we’ve been using the awesome Gorillapod, but the devices we’re using don’t have many compatible stands. So, seeing as I am a bit of a LEGO fanatic, I decided to see if I could build a suitable harness in my spare time.

An initial prototype for holding the phone didn’t take me too long to put together – and worked really well – so I decided to use LEGO’s Pick-A-Brick service to order all the parts I needed to build it without using parts from my own supply.

Complete prototype of Eideticker harness
Complete prototype of the Eideticker harness.

Other than unexpectedly finding two tiny white cupboard drawers(!) in my Pick-A-Brick order, the new case was perfect! A prototype for holding the PointGrey camera in place also didn’t take too long to put together once I’d worked out the ideal distance from the phone and height.

Once again I used the Lego Digital Designer to create a more polished version, and went to the Pick-A-Brick service to order the parts. These arrived just today, so I put together the final version. As you may notice from the photo of the complete prototype I had been using blu-tack to fix the camera in place, however for the final version I glued a 2×2 flat tile to the tripod mount that came with the camera.

Final version of Eideticker harness
Final version of Eideticker harness.

This was the only irreversible part of the build, so I was a little nervous about doing it. I first sanded the surface of the tile so had more surface area, applied a small amount of glue to the tripod mount, and pressed the tile into place. Of course if it had gone wrong, I would only have needed to order a new tripod mount – obviously I would not recommend gluing anything directly to the camera!

If you’re interested in seeing Eideticker in action and you happen to be attending the Mozilla Summit in Brussels then I will be taking the harness with me for demonstrations. If you’re interested in building the harness for yourself, the following resources will be helpful:

Also, here’s a few more photos and screenshots of the Lego Digital Designer creations. All photos were taken with my ZTE Open running Firefox OS:

Running the ‘Mem Buster’ endurance test

I blogged a few weeks ago about how I was able to demonstrate improvements to the memory usage of Firefox using endurance tests. The test I was using was inspired by Stuart Parmenter’s Mem Buster test, and it has now been checked into the repository and available for anyone to run.

At the moment it’s only possible to run the Mem Buster test from the command line (hopefully Mozmill Crowd support won’t be too far off). The first thing you’ll need is the mozmill-automation repository checked out. If you already have this then you’ll need to do a pull and update to make sure you have the latest changes.

hg clone http://hg.mozilla.org/qa/mozmill-automation

Then, from the mozmill-automation directory, run the following command:

./testrun_endurance.py --reserved=membuster --delay=3 --iterations=2 --entities=100 --report=http://mozmill-crowd.brasstacks.mozilla.com/db/ /Applications/Firefox.app

The reserved argument test the script to only run the Mem Buster test and not the general endurance tests. The test opens a site for each entity, so by specifying 100 entities and 2 iterations it will open a total of 200 sites. The delay of 3 seconds is from the original Mem Buster test. I would recommend including the report argument as this shares your results and allows you to see the visualisation of memory usage during the test. The final argument is the location of the version of Firefox you want to run the tests against.

Below is a screencast demonstrating the Mem Buster endurance test on Windows 7:

If you’re interested in following the progress of the endurance tests project, check out the project page. For further help you can find the documentation here, post a comment to this entry, or ask a question in the QMO forums.