Anamaria Stoica

My Mozilla Blog

Posts Tagged ‘Scheduler Database

Mozilla’s Build System

with 12 comments

Mozilla’s Build System is a very cool distributed system run by Buildbot. The system automatically rebuilds and tests the tree every time something has changed.

The Build Infrastructure currently has around 1,000 machines grouped into 3 pools, each made up of several Build Masters and many Slaves:

  • Build Pool (handles builds triggered by all changes, except those going to Try):
    • 4 Build Masters
    • ~300 Slaves
  • Try Build Pool (handles Try builds):
    • 1 Build Master
    • ~200 Slaves
  • Test Pool (handles all tests, including Try)
    • 7 Test Masters
    • ~400 Slaves

How it works

The hg poller looks for new changes in the hg.mozilla.org repository every few minutes. The changes are picked up by the Build Scheduler Master, which creates Build Requests, one for each of the supported platforms. The Build Requests go into the Scheduler Database as pending. The Build Masters look for pending Build Requests and take them on only if there are free Slaves to assign them to.

Mozilla's Build System

As the builds complete, the Build Master updates their statuses in the Scheduler Database. Also, the Test Scheduler Master creates Test Build Requests for the corresponding tests.

Next, the Test Build Requests are picked up by the Test Masters and assigns them to free Slaves. When the tests are complete, the Test Master updates back their statuses in the Scheduler Database.

Each Build Master and Test Master controls its own set of Slaves.

Build Run Life Cycle

One push to mozilla-central, if successful, generates a total of 168 Build Requests (as of October 2010, but subject to change in the future), from which 10 are builds (one for each of the supported 10 platforms), 108 unittests and 50 talos tests. All these build requests make up a Build Run.

Each of the 10 platform builds comes with its own set of test requests. The tests are created only when the corresponding build completes, and only if successful. Which means that if there are failed builds, some of the tests won’t be created, and the Build Run won’t have 168 Build Requests, but less.

Build Run Life Cycle

Two very important measures in a Build Runs’s life cycle are the Wait Time and End to End Time.

The Wait Time measures how long Build Requests wait in the queue before starting, more specific, it measures the time difference between the timestamp of the change that generated that Build Request and the timestamp of when that Build Request is assigned to a free slave. (see Build Run Life Cycle diagram above)

The End to End Time measures how long it takes for a Build Run to complete. That is, the time difference between the timestamp of the change that triggered this Build Run and the timestamp of when the last of the generated Build Requests ends (in other words, when all builds and tests are completed). (see Build Run Life Cycle diagram above)

The normal End to End Time for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load.

The Great Wall of Mac minis

The builds are done on a mix of VMs, 1U servers, xserves and Mac minis, and all the testing is done on Mac minis.

The Great Wall of Mac minis is made up of a little over 400 of the Mac minis’ boxes, and is located by the Release Engineers’ desks in the Mountain View office. đŸ˜€

Advertisements

Pushes Query

with 2 comments

One other very important piece of information that can be extracted from the Scheduler Database, besides build requests and jobs, are pushes.

The information about one push is spread among 3 tables: sourcestams, sourcestamp_changes and changes.

The SQLAlchemy query fetches all pushes in a specific time frame, and allows filtering and exclusion of specific branches:

s = meta.scheduler_db_meta.tables[‘sourcestamps’]
sch = meta.scheduler_db_meta.tables[‘sourcestamp_changes’]
c = meta.scheduler_db_meta.tables[‘changes’]

q = select([s.c.revision, s.c.branch, c.c.author, c.c.when_timestamp],
and_(sch.c.changeid == c.c.changeid, s.c.id == sch.c.sourcestampid))
q = q.group_by(c.c.when_timestamp, s.c.branch)

# 3. exlude branches – not of interest / fake
# 4. filter branches
# 5. timeframe

Query explained:

  1. JOIN between sourcestamps, sourcestamp_changes and changes tables. The sourcestamps table contains information about the revision, branch, author and the changes table contains information about the change’s timestamp (when_timestamp).
  2. GROUP BY – next we group by the change’s timestamp (multiple builds of the same push will have the same when_timestamp) and branch (one push could affect multiple branches).
  3. Exclude branches that are not of interest like l10n and fake branches like addontester or the ones generated by unittests or talos tests (e.g. mozilla-central-win32-debug-unittest or mozilla-central-macosx64-talos).
  4. Fetch only pushes of requested branches (e.g. mozilla-central, try).
  5. Fetch only pushes having the change’s timestamp (c.when_timestamp) in the requested time frame, specified by starttime and endtime.

See Also: Build Request Query, Wait Times Query

Written by Anamaria Stoica

October 15, 2010 at 4:51 am

Posted in Buildapi, Mozilla

Tagged with , ,

Wait Times Query

with 3 comments

The Wait Times Query is very similar to Build Request Query, only that it fetches jobs (does not care about multiple builds of the same BuildRequest), it selects a different subset of columns and has several other restrictions in addition.

The base query, using SQLAlchemy, looks like this:

q = outerjoin(br, b, b.c.brid == br.c.id) \
.join(bs, bs.c.id == br.c.buildsetid) \
.join(s, s.c.id == bs.c.sourcestampid) \
.outerjoin(sch, sch.c.sourcestampid == s.c.id) \
.outerjoin(c, c.c.changeid == sch.c.changeid) \
.select().with_only_columns([…])
# multiple restrictions
.group_by(br.c.id)

For the meaning of the JOINs and the tables involved, see Build Request Query. In this post, I’ll continue by describing only the differences (placed where commented #more restrictions):

1. Pool selection – fetching the jobs belonging only to a pool

This is done by filtering jobs claimed by masters in the selected pool (i.e. by looking at values of buildrequests.claymed_by_name column). There are currently 3 pools: ‘buildpool’, ‘trybuildpool’ and ‘testpool’, each having a different number of masters. For example, buildpool has 4 masters:

  • ‘production-master01.build.mozilla.org’
  • ‘production-master03.build.mozilla.org’
  • ‘buildbot-master1.build.scl1.mozilla.com:/builds/buildbot/build_master3’
  • ‘buildbot-master2.build.scl1.mozilla.com:/builds/buildbot/build_master4’

The masters in each pool are specified by BUILDPOOL_MASTERS in buildapi.model.util module.

One exception are PENDING jobs, as they haven’t been claimed by any master yet (buildrequest.claimed_by_name is NULL). However, it is possible to tell which pool they belong to by looking at buildrequests.buildername‘s value:

  • buildpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername NOT LIKE ‘Rev3%’ AND br.buildername NOT LIKE ‘% tryserver %’
  • trybuildpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername NOT LIKE ‘Rev3%’ AND br.buildername LIKE ‘% tryserver %’
  • testpool: br.claimed_by_name is NULL AND br.complete = 0 AND br.buildername LIKE ‘Rev3%’

(where br is buildrequests table)

2. Timeframe filtering

Filters out only the jobs with the change’s timestamp in the interval [starttime, endtime). The change’s timestamp is specified by changes.when_timestamp column, except for the nightly builds that have no changes. In those cases we’ll look at buildrequest.submitted_at values (which are usually at most a few minutes later).

q = q.where(or_(c.c.when_timestamp >= starttime, br.c.submitted_at >= starttime))
q = q.where(or_(c.c.when_timestamp < endtime, br.c.submitted_at < endtime))

3. Rebuilds and forced builds exclusion

All rebuilds and forced builds are excluded from the stats. This is done by looking at buildsets.reason column, and filtering out values found in buildapi.model.util.WAITTIMES_BUILDSET_REASON_SQL_EXCLUDE.

4. Exclude buildernames that are not of interest, like fuzzers

The exclusion list is specified by buildapi.model.util.WAITTIMES_BUILDREQUESTS_BUILDERNAME_SQL_EXCLUDE.

See Also: Build Request Query, Pushes Query, Wait Times Report.

Written by Anamaria Stoica

October 13, 2010 at 12:30 am

Build Request Query

with 6 comments

Many of the reports (End to End Times Report, Build Run Report, TryChooser Report, Average Time per Builder Report, Builder Report) use BuildRequests as constructing blocks. In this post I will describe how BuildRequests are fetched from Buildbot’s scheduler database.

First of all, scheduler database has the following schema:

Scheduler Database Schema

Scheduler Database Schema

The information about one BuildRequest is spread among at least 5 tables: builds, buildrequests, buildsets, sourcestamps, sourcestamp_changes and changes. Which means there is no other way to fetch the data we need other than creating a big JOIN/OUTERJOIN for the 5 above mentioned tables. This rather unfriendly query is necessary as a result of scheduler database’s design to work optimal with Buildbot’s internal mechanisms rather than our current query’s need.

The actual query, using SQLAlchemy, looks like this:

b = meta.scheduler_db_meta.tables[‘builds’]
br = meta.scheduler_db_meta.tables[‘buildrequests’]
bs = meta.scheduler_db_meta.tables[‘buildsets’]
s = meta.scheduler_db_meta.tables[‘sourcestamps’]
sch = meta.scheduler_db_meta.tables[‘sourcestamp_changes’]
c = meta.scheduler_db_meta.tables[‘changes’]

q = outerjoin(br, b, b.c.brid==br.c.id) \
.join(bs, bs.c.id==br.c.buildsetid) \
.join(s, s.c.id==bs.c.sourcestampid) \
.outerjoin(sch, sch.c.sourcestampid==s.c.id) \
.outerjoin(c, c.c.changeid==sch.c.changeid) \
.select().with_only_columns([…]) \
.group_by(br.c.id, b.c.id)

Query explained:

JOINS:

  1. OUTERJOIN (LEFT OUTER JOIN) between buildrequests and builds tables – OUTERJOIN is required because some of the BuildRequests might be PENDING or be CANCELLED (thus having no builds, i.e. entries in the builds table)
  2. JOIN with buildests table on buildsetsid column (bs.id = br.buildsetid) – we need to go through the buildesets table in order to link the BuildRequests to the sourcestamps and changes information
  3. JOIN with sourcestamps table – sourcestamps information
  4. OUTERJOIN with sourcestamp_changes on sourcestampsid column (s.id = sch.sourcestampid) – linking further along to changes. An OUTERJOIN was necessary instead of an INNER JOIN, because the nightly builds don’t have a revision number or any entries in the changes table
  5. OUTERJOIN with changes table on changeschangeid column (sch.changeid = c.changeid) – OUTERJOIN again needed in order to include all BuildRequests belonging to nightly builds (see JOIN 4. above)

GROUP BY:
A final group by buildrequests.id and builds.id columns (GROUP BY br.id, b.id) is needed to capture multiple builds for the same BuildRequest. One BuildRequest might have multiple builds (usually very few and at most 2 or 3), if the builds have been retriggered or forced build manually.

Selected table columns explained:

  • b.number
  • b.c.start_time
  • b.c.finish_time
  • br.c.id.label(‘brid’)
  • br.c.buildername
  • br.c.submitted_at
  • br.c.claimed_at
  • br.c.claimed_by_name
  • br.c.complete
  • br.c.complete_at
  • br.c.results
  • br.c.buildsetid
  • bs.c.reason
  • s.c.id.label(‘ssid’)
  • s.c.branch
  • s.c.revision
  • c.c.when_timestamp
  • c.c.author
  • c.c.comments
  • c.c.revlink
  • c.c.category
  • c.c.repository
  • c.c.project

BuildRequest statuses:

  • PENDING – the BuildRequest has not started yet / no Build Master claimed it yet:
NOT b.start_time AND NOT br.claimed_at AND NOT br.complete AND NOT br.complete_at AND NOT b.finish_time
  • RUNNING – the BuildRequest is running (a Build Master claimed the BuildRequest already), and has not finished yet:
b.start_time AND br.claimed_at AND NOT br.complete AND NOT br.complete_at AND NOT b.finish_time
  • COMPLETE – the BuildRequest was completed without any internal errors or external interruptions (i.e. not CANCELLED / INTERRUPTED):
b.start_time AND br.claimed_at AND br.complete AND br.complete_at AND b.finish_time
  • CANCELLED – the BuildRequest was cancelled (i.e. it never got to start):
NOT b.start_time AND NOT br.claimed_at AND br.complete AND br.complete_at AND NOT b.finish_time
  • INTERRUPTED – the build was interrupted (e.g. slave disconnected) and Buildbot retriggered the build:
b.start_time AND br.claimed_at AND br.complete AND br.complete_at AND NOT b.finish_time
  • MISC – should never happen

BuildRequest results (buildrequests.results):
This column specifies how the BuildRequest execusion went, if it is completed. Naturally, the PENDING and RUNNING ones will have NO_RESULT:

  • -1 (NULL) – NO_RESULT
  • 0 – SUCCESS
  • 1 – WARNINGS
  • 2 – FAILURE
  • 3 – SKIPPED
  • 4 – EXCEPTION
  • 5 – RETRY

BuildRequest reasons (buildsets.reason):
The reason of the build, might be the scheduler (normal case), the nightly sheduler, a rebuild or forced build:

  • scheduler
  • nightly, e.g. ‘The Nightly scheduler named ‘Linux x86-64 mozilla-central nightly’ triggered this build’
  • rebuild, e.g. ‘The web-page ‘rebuild’ button was pressed by ‘<unknown>’: redo for slave disconect (nthomas)’
  • force build, e.g. ‘The web-page ‘force build’ button was pressed by ‘jhford’: hg poller is busted’

BuildRequest wait time:
How much the BuildRequest waited from when the change was created or the time it was submitted (only for nightlies because they have no changes) until the build has started (was assigned to a free slave):

change_time := c.when_timestamp, if c.when_timestamp != NULL
:= br.submitted_at, otherwise
WAIT_TIME := b.start_time – change_time, if b.start_time != NULL AND change_time != NULL
:= 0, otherwise

BuildRequest duration:
How long from when the change was created  or the time it was submitted (only for nightlies because they have no changes) until the build was complete, whether if successful or not:

change_time := c.when_timestamp, if c.when_timestamp != NULL
:= br.submitted_at, otherwise
DURATION := br.complete_at – change_time, if br.complete_at != NULL AND change_time != NULL
:= 0, otherwise

BuildRequest run time:
The actual run time of the build:

RUN_TIME := DURATIONWAIT_TIME

See Also: Wait Times Query, Pushes Query

Written by Anamaria Stoica

October 4, 2010 at 10:06 pm