Friday 21 February 2014

A Continuous Integration Train Smash

If you are doing it right then very soon your Jenkins server will become essential to the functioning of every element of your business.

Jenkins will handle the testing, measurement, packaging and deployment of your code.

We have one Jenkins installation, this has grown in capability as it has in importance.

Like most organisations we grew our CI infrastructure organically. Developers from Android, iOS, Java core, front end and sysadmin teams all added jobs and plugins to support them. Some jobs are built on the server and some on slaves.

No record of who installed which plugins, when or why was kept.

We were aware that we needed to backup this crucial element of infrastructure, though never did a recovery dry run.

We decided to add the configuration of the server to git, initially manually and then using the SCM Sync configuration plugin, however we did not test restoring from this.

After a while we noticed errors in the Jenkins logs, and on the screen, about git. The errors in the logs came from some bad git config files, manually fixed. The problems with the SCM Sync configuration plugin were worse and were tracked down to Renaming job doesn't work with Git. The work around given does work, but the plugin breaks in a very nasty and difficult to fix way which requires the server to be restarted. We decided to remove the plugin, even after fixing the current symptoms.

All was good to go, we had a working server, no errors in the logs, all clean and uptodate.

Snatching defeat from the jaws of victory

Prior to the restart I had updated all plugins to their latest versions. This is something I have done many times over the last five years and it has never failed. As the first question one is asked in forums is "Have you updated to the latest version?" it is a step I have, until now taken for granted.

After running a few builds, the following day, Jenkins froze.

The last build to be run was a new, complex, important one, involving Virtual Machines, written by my boss.

I restarted the server, taking the opportunity to update three more plugins as I went.

Again it limped on for a while then the UI froze.

We disabled all plugins (a very competent colleague had joined me) by creating a .disabled file in the plugins directory:

for x in *.jpi; do touch $x.disabled; done

Then we set about re-enabling them, one letter at a time, repeat for a-z :


rm -v a*.jpi.disabled
sudo /etc/init.d/jenkins restart

This revealed that the problem was in a plugin starting with t, one of:

tasks.jpi
thinBackup.jpi
throttle-concurrents.jpi
translation.jpi
token-macro.jpi

Whilst it looked like it might be token-macro.jpi it then appeared not to be, the restarts were taking an increasing length of time.

At this point we decided that it would be better to revert to a backup.

The sysadmin team initiated the restore from backup, then discovered that there was still a spinning, 100% CPU, process and that it was from the throttle-concurrent plugin.

A quick google lead to JENKINS-21044, a known Blocker issue. On the wiki this is flagged:

Warning!
The version has a "blocker" issue caused by threads concurrency. See JENKINS-21044 for more info.

It was however too late to stop the restore. The backup failed at 8.00pm

By 7.00pm the following evening, Friday, after a day of configuration by most of the developers, we were back to where we had been on Wednesday night.

The long tail of the event continues through Monday.

Friday 14 February 2014

Increasing Cucumber Test Speed using Annotations

Cucumber is a neat framework for writing English language specifications, sometimes called Behaviour Driven Development or Specification By Example.

We are using Cucumber to test an API, and so we have perhaps strayed from the true path and used language which is not necessarily at the business logic level, however it does serve well as documentation for its intended audience.

One of the features of Cucumber is a test template, called a Scenario Outline, which enables one to parameterize a test with values:


  Scenario Outline: Calling by code gives full title
    Given a thing with code <code> and title <title>
    When calling "/rest/thing/<code>"
    Then expect json:
    """
      {"thing":{"code":"<code>", "title": "<title>"}}
    """
  Examples:
    |code|title  |
    |a   |Apples |
    |b   |Bananas|
    |c   |Cucumbers|

This is very powerful and enables one to add a lot of tests very quickly.

If your test setup takes non-negligible time, such as creating a clean new database, then you will quickly create a test suite whose running time is heading towards hours.

We addressed this by only recreating the database when necessary. We know when that is: it is when the test alters the database in such a way as to interfere with it or other tests being run again. Initially I thought of this quality as idempotency however I realised what I was actually talking about was repeatability. If your test creates a record with a field which has a unique constraint then you will need to delete that record before that test can be repeated.

We can use the Scenario Outline functionality to repeat a test, if it is not already parameterized, by simply adding an unused parameter.


  Scenario Outline: Create a thing
    When putting "/rest/thing/b/Banana"
    Then expect json:
    """
      {created:{"thing":{"code":"b", "title": "Banana"}}}
    """
    Then remove thing b
  Examples:
    |go|
    |one|
    |two|

We now know that this test can be run twice, so we are justified in tagging it @repeatable.

Initialisation when needed

If a test is @repeatable then we know that we do not need to create a clean database.

The Java snippet below sets the variable DIRTY if the test is not tagged @repeatable.

  private static boolean DIRTY = true;

  @Before(value = "@reuseCurrentDb", order = 1)
  public void reuseCurrentDb() {
    DIRTY = false;
  }
  @Before(order = 2)
  public void resetDB() {
    if (DIRTY) {
      try {
        if (CONNECTION == null) {
          CONNECTION = init(dataSource_.getConnection());
        }
        importDB(CONNECTION, dataSet_, DatabaseOperation.CLEAN_INSERT);
      }
      catch (Exception e) {
        throw new RuntimeException("Import failed before test run", e);
      }
    }
  }


  @After(order = 2)
  public void afterAll() {
    DIRTY = true;
  }


  @After(value = "@repeatable", order = 1)
  public void afterRepeatable(Scenario scenario) {
    if (!scenario.isFailed()) {
      DIRTY = false;
    }
  }

In retrospect it looks as though we should have inverted this, with a tag @dirties, as now almost every test is tagged @repeatable. However that does not take into account the development sequence: the test is first made to run once and then is made @repeatable.

Anticipated and Unanticipated Wins

This was intended to speed up our tests and did: from 48 minutes to 6 minutes.

The unintended win was that by focussing on cleanup we discovered three database tables which did not have correct deletion cascading.

This approach may surface other hidden problems, either with your tests or with the system under test; this is a good thing.

Update (2014-06-19)

By chipping away at the cascades, adding methods to capture is from returned JSON and adding deletion methods one is able to reduce the database creations to one, and any dirtying of the database throws an exception.

761 Scenarios (761 passed)
10364 Steps (10364 passed)
1m40.157s

  private static boolean DIRTY = true;

  private static HashMap rowCounts_;

  private HashMap rowCounts() {
    if (rowCounts_ == null) {
      rowCounts_ = setupTableRowCounts();
    }
    return rowCounts_;
  }

  @Before(order = 1)
  public void resetDB() {
    if (DIRTY) {
      try {
        if (I_DB_CONNECTION == null) {
          I_DB_CONNECTION = init(dataSource_.getConnection());
        }
        importDB(I_DB_CONNECTION, dataSet_, DatabaseOperation.CLEAN_INSERT);
      }
      catch (Exception e) {
        throw new RuntimeException("Import failed before test run", e);
      }
    }
  }


  private HashMap setupTableRowCounts() {
    HashMap tableRowCounts = new HashMap();
    try {
      String[] normalTables = { "TABLE" };
      DatabaseMetaData m = I_DB_CONNECTION.getConnection().getMetaData();
      ResultSet tableDescs = m.getTables(null, null, null, normalTables);
      while (tableDescs.next()) {
        String tableName = tableDescs.getString("TABLE_NAME");
        tableRowCounts.put(tableName, I_DB_CONNECTION.getRowCount(tableName));
      }
      tableDescs.close();
    }
    catch (SQLException e) {
      throw new RuntimeException(e);
    }
    return tableRowCounts;
  }

  private void checkTableRowCounts(Scenario scenario) {
    try {
      DatabaseMetaData m = I_DB_CONNECTION.getConnection().getMetaData();
      String[] normalTables = { "TABLE" };
      ResultSet tableDescs = m.getTables(null, null, null,
          normalTables);
      String problems = "";
      while (tableDescs.next()) {
        String tableName = tableDescs.getString("TABLE_NAME");
        int old = rowCounts().get(tableName);
        int current = I_DB_CONNECTION.getRowCount(tableName);
        if (old != current) {
         problems += " Table " + tableName + " was " + old 
                     + " but now is " + current + "\n";
        }
      }
      tableDescs.close();
      if (!problems.equals("")) {
        problems = "Scenario " + scenario.getName() + " :\n" + problems;
        throw new RuntimeException(problems);
      }
    }
    catch (SQLException e) {
      throw new RuntimeException(e);
    }
  }


  @After(order = 2)
  public void afterAll(Scenario scenario) {
    if (scenario.isFailed()) {
      DIRTY = true;
    }
    else {
      DIRTY = false;
      try {
        checkTableRowCounts(scenario);
      }
      catch (RuntimeException e) {
        DIRTY = true;
        throw e;
      }
    }
  }