Anamaria Stoica

My Mozilla Blog

Posts Tagged ‘Wait Times

Mozilla’s Build System

with 12 comments

Mozilla’s Build System is a very cool distributed system run by Buildbot. The system automatically rebuilds and tests the tree every time something has changed.

The Build Infrastructure currently has around 1,000 machines grouped into 3 pools, each made up of several Build Masters and many Slaves:

  • Build Pool (handles builds triggered by all changes, except those going to Try):
    • 4 Build Masters
    • ~300 Slaves
  • Try Build Pool (handles Try builds):
    • 1 Build Master
    • ~200 Slaves
  • Test Pool (handles all tests, including Try)
    • 7 Test Masters
    • ~400 Slaves

How it works

The hg poller looks for new changes in the hg.mozilla.org repository every few minutes. The changes are picked up by the Build Scheduler Master, which creates Build Requests, one for each of the supported platforms. The Build Requests go into the Scheduler Database as pending. The Build Masters look for pending Build Requests and take them on only if there are free Slaves to assign them to.

Mozilla's Build System

As the builds complete, the Build Master updates their statuses in the Scheduler Database. Also, the Test Scheduler Master creates Test Build Requests for the corresponding tests.

Next, the Test Build Requests are picked up by the Test Masters and assigns them to free Slaves. When the tests are complete, the Test Master updates back their statuses in the Scheduler Database.

Each Build Master and Test Master controls its own set of Slaves.

Build Run Life Cycle

One push to mozilla-central, if successful, generates a total of 168 Build Requests (as of October 2010, but subject to change in the future), from which 10 are builds (one for each of the supported 10 platforms), 108 unittests and 50 talos tests. All these build requests make up a Build Run.

Each of the 10 platform builds comes with its own set of test requests. The tests are created only when the corresponding build completes, and only if successful. Which means that if there are failed builds, some of the tests won’t be created, and the Build Run won’t have 168 Build Requests, but less.

Build Run Life Cycle

Two very important measures in a Build Runs’s life cycle are the Wait Time and End to End Time.

The Wait Time measures how long Build Requests wait in the queue before starting, more specific, it measures the time difference between the timestamp of the change that generated that Build Request and the timestamp of when that Build Request is assigned to a free slave. (see Build Run Life Cycle diagram above)

The End to End Time measures how long it takes for a Build Run to complete. That is, the time difference between the timestamp of the change that triggered this Build Run and the timestamp of when the last of the generated Build Requests ends (in other words, when all builds and tests are completed). (see Build Run Life Cycle diagram above)

The normal End to End Time for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load.

The Great Wall of Mac minis

The builds are done on a mix of VMs, 1U servers, xserves and Mac minis, and all the testing is done on Mac minis.

The Great Wall of Mac minis is made up of a little over 400 of the Mac minis’ boxes, and is located by the Release Engineers’ desks in the Mountain View office. đŸ˜€

Introducing the Wait Times Report

with 3 comments

The Wait Times Report was the first report I got to work on. The report measures how long jobs wait in the queue before starting, more specific, it measures the time difference between the timestamp of the change that generated that job and the timestamp of when that job is assigned to a free slave.

The report is per build pool: build pool, try build pool and test pool. For more specific details on the report contents jump further to Report Contents.

It also allows the specification of a timeframe for the jobs (starttime and endtime as UNIX timestamps). If these parameters are not specified, the defaults are used: endtime will be the server’s current timestamp and starttime 24 hours before (i.e. the last 24 hours).

To see exactly how jobs are selected from the Scheduler Database, and what restrictions are applied on them, see Wait Times Query.

URL & Parameters

The Wait Times report can be accessed the following URL:

<hostname>/reports/waittimes/<pool>?(<param>=<value>&)*

, <pool> := buildpool | trybuildpool | testpool

Prameters (all optional):

  • format – format of the output; allowed values: html, json, chart; default: html
  • starttime – start time, UNIX timestamp (in seconds); default: endtime minus 24 hours
  • endtime – end time, UNIX timestamp (in seconds); default: starttime plus 24 hours or current time (if starttime is not specified either
  • int_size – break down results per time intervals of this length if value > 0; values are specified in seconds; default: 7200 (2 hours)
  • mpb – minutes per block, length of wait time block in minutes; default: 15
  • maxb – maximum block size; for wait times larger than maxb, group stats into the largest block, if maxb > 0; default: 0
  • num – the wait times for each block are represented either as the actual values (full) or percentages (ptg), relevant only if format=chart; allowed values: full, ptg; default: full
  • tqx – used by Google Visiualization API (automatically appended by the library), relevant only if format=chart; default:

Wait Time E-Mails

The Wait Time e-mails are sent by fetching and parsing the JSON format of these reports (found at <report_url>?<report_params>&format=json).

Report Contents

The report measures how long jobs wait in the queue before starting, considering all jobs in one build pool, submitted in a specified timeframe (several other filters are applied too).

The report groups jobs’ wait times in blocks of mpb minutes, for example: 0-15, 15-30, 30-45,… are the first 3 blocks, where a block has 15 minutes (mpb=15). For each of these blocks, the report counts how many jobs had their wait time in that interval.

Let’s say we have:
0-15  44 88%
15-30 5 10%
30-45 1 2%
In the report above, we have 50 jobs, from which 44 jobs waited between 0 and 15 minutes, representing 88% of all jobs registered, 5 jobs (10%) waited between 15 and 30 minutes and only 1 job (2%) waited more than 30 minutes.
For a real, more detailed example, scroll down to Wait Times Example.

The same stats are computed, but broke down by platform (linux, linux64, fedora, snowleopard, xp, … for complete list see buildapi.model.util.PLATFORMS_BUILDERNAME).

Report Python Class
The Wait Times Report Python class can be found at buildapi.model.waittimes.WaitTimesReport.

Constructing the Report
The report is computed by calling buildapi.model.waittimes.GetWaitTimes. This function calls buildapi.model.waittimes.WaitTimesQuery, which handles the logic of selecting only the jobs of interest. See Wait Times Query post for further details.

Each of the jobs are added to the report one by one, and the report stats are updated in the same time.

Other Report Info:

  • unknownbuilders – excluded builders, like l10n
  • otherplatforms – platforms not found in known platforms, and not excluded
  • pending – jobs that have not started yet (still waiting)
  • has_no_changes – jobs that have no change, like nightly builds

Example

Wait Times for August 6th, 2010, for try build pool. The report online looks like this:

We can see the wait times were bad for that day, only 58.84% (752) jobs waited between 0 and 15 minutes, 5.24% (64) jobs waited between 15 and 30 minutes, and over 28% (362) jobs waited more than 60 minutes (blue table on the left)! On the right the numbers are broke down by platform (green tables on the right).

The overall wait times (blue table on the left) are also displayed as charts broke down by time intervals (int_size = 2 hours):

Wait Times Aug 6th Trybuildpool - Percentage Stacked Chart

Chart 1 - Percentage Stacked Chart

Chart 1 displays the percentage of each of the wait time blocks per time interval. For example, in the 2:00-4:00 interval, around 50% of the jobs waited less than 15 minutes (blue color), around 30% jobs waited 15 to 30 minutes (red color), and 20% jobs waited 30 to 45 minutes (orange), and there are no jobs that waited more than 45 minutes. You can see that starting with 2PM (14:00) wait times started going really bad, and from 6PM-8PM the majority of jobs waited more than 60 minutes (purple block)!

Same data, but scaled by number of jobs:

Chart 2 - Stacked Chart

Chart 2 - Column Chart

See Also: Wait Times Query, Pushes Report

Written by Anamaria Stoica

October 13, 2010 at 12:38 pm

Posted in Buildapi, Mozilla

Tagged with , ,

Wait Times Query

with 3 comments

The Wait Times Query is very similar to Build Request Query, only that it fetches jobs (does not care about multiple builds of the same BuildRequest), it selects a different subset of columns and has several other restrictions in addition.

The base query, using SQLAlchemy, looks like this:

q = outerjoin(br, b, b.c.brid == br.c.id) \
.join(bs, bs.c.id == br.c.buildsetid) \
.join(s, s.c.id == bs.c.sourcestampid) \
.outerjoin(sch, sch.c.sourcestampid == s.c.id) \
.outerjoin(c, c.c.changeid == sch.c.changeid) \
.select().with_only_columns([…])
# multiple restrictions
.group_by(br.c.id)

For the meaning of the JOINs and the tables involved, see Build Request Query. In this post, I’ll continue by describing only the differences (placed where commented #more restrictions):

1. Pool selection – fetching the jobs belonging only to a pool

This is done by filtering jobs claimed by masters in the selected pool (i.e. by looking at values of buildrequests.claymed_by_name column). There are currently 3 pools: ‘buildpool’, ‘trybuildpool’ and ‘testpool’, each having a different number of masters. For example, buildpool has 4 masters:

  • ‘production-master01.build.mozilla.org’
  • ‘production-master03.build.mozilla.org’
  • ‘buildbot-master1.build.scl1.mozilla.com:/builds/buildbot/build_master3’
  • ‘buildbot-master2.build.scl1.mozilla.com:/builds/buildbot/build_master4’

The masters in each pool are specified by BUILDPOOL_MASTERS in buildapi.model.util module.

One exception are PENDING jobs, as they haven’t been claimed by any master yet (buildrequest.claimed_by_name is NULL). However, it is possible to tell which pool they belong to by looking at buildrequests.buildername‘s value:

  • buildpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername NOT LIKE ‘Rev3%’ AND br.buildername NOT LIKE ‘% tryserver %’
  • trybuildpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername NOT LIKE ‘Rev3%’ AND br.buildername LIKE ‘% tryserver %’
  • testpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername LIKE ‘Rev3%’

(where br is buildrequests table)

2. Timeframe filtering

Filters out only the jobs with the change’s timestamp in the interval [starttime, endtime). The change’s timestamp is specified by changes.when_timestamp column, except for the nightly builds that have no changes. In those cases we’ll look at buildrequest.submitted_at values (which are usually at most a few minutes later).

q = q.where(or_(c.c.when_timestamp >= starttime, br.c.submitted_at >= starttime))
q = q.where(or_(c.c.when_timestamp < endtime, br.c.submitted_at < endtime))

3. Rebuilds and forced builds exclusion

All rebuilds and forced builds are excluded from the stats. This is done by looking at buildsets.reason column, and filtering out values found in buildapi.model.util.WAITTIMES_BUILDSET_REASON_SQL_EXCLUDE.

4. Exclude buildernames that are not of interest, like fuzzers

The exclusion list is specified by buildapi.model.util.WAITTIMES_BUILDREQUESTS_BUILDERNAME_SQL_EXCLUDE.

See Also: Build Request Query, Pushes Query, Wait Times Report.

Written by Anamaria Stoica

October 13, 2010 at 12:30 am