Anamaria Stoica

My Mozilla Blog

Mozilla-Central End to End Times Values Distribution (October)

with 5 comments

In my previous post End to End Times Report I started talking about E2E times, by defining what they are and then looking at some monthly E2E times averages of the past 3 months for mozilla-central and try.

I also kept mentioning that the normal E2E times for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load. Now, how much exactly do the E2E times vary away from the normal times and how?

In order have a better grasp on what the E2E times values distribution might be, I plotted the histogram of all E2E times for mozilla-central registered in October (more precisely October 1-20, 2010). And here’s how it looks after removing the outliers:

Mozilla-central E2E times histogram without outliers (October 1-20)

The histogram above represents the distribution of the E2E times among bins of 15 minutes.

As it turns out the histogram looks pretty nice. Most values (66.38%) are located in the 3h – 4h 25m normal time interval, with a high peak in the 3h 45m – 4h time subinterval.

However there is a long tail of values between 5h to 10 hours. Even though the number of values in each bin is small, summed up together they represent around 15% of the Build Runs.

The values smaller than 3h (10.92%) are build failures and exceptions. The very large outliers (>10h) were excluded from the histogram. They represent 7.18% of all Build Runs, with 4.02% between 10-25h and 3.16% between 25-255h (see plot bellow with outliers included).

Time Interval Percentage (%) Comments
0 – 3h 10.92 % Failures
3h – 4h 25m 66.38 % Normal times
4h 25m – 10h 15.52 % Long tail of large values
>10h 7.18 % Outliers
* 10h-25h: 4.02%
* >25h: 3.16%
Branch mozilla-central
Timeframe ~October 1-20, 2010
No. values 348
Max value 255h 51m
Mean value 7h 12m
Median value 3h 42m

 

Here’s the histogram re-plotted, but this time with all the outliers included:

See Also: End to End Times Report, Mozilla’s Build System.

Advertisements

Written by Anamaria Stoica

November 22, 2010 at 8:02 am

5 Responses

Subscribe to comments with RSS.

  1. I’m not sure it makes sense to call those long-running builds “outliers”, unless we know those are somehow broken.

    It you want to trim down the graph to make it more readable, I’d scoop up all of that long tail into a single “more than X” bar… That ~22% of our builds take 4+ hours isn’t something we should dismiss as a quirk!

    Justin Dolske

    November 22, 2010 at 6:17 pm

  2. Anamarias can you please elaborate on the outliers and the long tail large values?

    @dolske if I am not mistaken many of those really off jobs can be testing machines that get stuck and cannot kill buildbot and, therefore, cannot stop. Once every few days we notice those machines running for days and we have to jump manually and kill them. These are many of our growing pains this year and we have been tackling them as we go. Better tools like what Anamarias has been working on they will actually help us spot these issues and help us investigate them.

    armenzg

    November 24, 2010 at 9:54 am

  3. I agree with dolske. Sounds like that long tail is something we ought to be investigating and fixing. If we have things legitimately taking 4+ hours, that’s wasting a huge amount of CPU cycles. Fixing just one 8 hour build wins you enough CPU time for two normal builds!

    Ted Mielczarek

    November 24, 2010 at 10:09 am

  4. Thanks for your questions, they made me look closer at what’s going on with the large e2e values.

    The charts in the post were simple plots of the actual numbers to find their distribution, without taking into consideration what they represent. Now comes the 2nd step: adding semantics to them, and see what can be improved.

    So, by looking at all the individual Build Runs in each of the 3 interval categories (upper outliers, lower outliers, long tail of large values), I found the following:

    1. Upper outliers (25h – 2255h), 3.16%.
    There were 11 out of 348 build runs, which is 3.16%.

    All of these Build Runs had only one unittest or talos that run a very long time, and either restarted, was rebuild or interrupted. All of the other Build Requests run for at most aprox. 2h 30m, with wait times under 25m (many under 10m).

    I would say these qualify as outliers, but in the same time point out an important problem. Some build requests run for an extremely long time from time to time, and need investigation of why that happens (might be the case of machine failure) or a better tool to detect them in due time so they can be retriggered earlier. They do look like jobs getting stuck as @armenzg mentioned.

    You can see the top examples here [0].

    2. Lower outliers (10h – 25h), 4.02 %.

    Here, part of the build runs qualify into the the first category (one test that goes crazy), part of them run so long because of the e2e Report is incomplete, see [1]. The good news is that individual builds get unique ids now (Bug 570814), and the report will be fixed soon. Another part of them falls into the 3rd category (see below).

    3. Long tail of large e2e values (4h 25m – 10h), 15.52 %.

    The reason these build runs take long is complex and it’s a mixture of long wait times, rebuilds, interrupted build requests, distribution of build requests, individual builds/tests that take longer than average (but not extremely long as the categories before).

    These category definitely need more analysis, and indeed, it would be nice to have all of this fall into the target e2e time (4 hours).

    [0] – https://spreadsheets.google.com/pub?key=0AmuNzjBMFEbedDkzUzg4ODFZS2kzaG5EVmNmaE5BUVE&hl=en&single=true&gid=0&output=html
    [1] – https://anamariamoz.wordpress.com/2010/11/15/end-to-end-times-report/#Problem

    Anamaria Stoica

    November 25, 2010 at 8:57 am

  5. Anamarias thanks a lot for taking the time to investigate this and explain it back to us.

    This brings into my head some thoughts:
    * If we added the max timeout of each step on a job, do we have jobs that take longer than this MAX_TIMEOUT? If so those jobs would need investigation as the timeouts should have prevented them to run over that max accumulative timeout.
    * Do we have a way to be notified when a job is running longer than expected? Not that I know of.
    * For jobs running over the estimated time what is preventing buildbot from finishing them?

    Interesting, from [0] it seems that there are successful runs “Running too long” that actually finished the job and reported green. If they were talos jobs I wonder how they could have correct numbers reported. I wonder if we could somehow take screenshots to determine if there is a dialog. Another idea would be to spot this type of jobs and once we have the timestamp on each line of a log (I think there is a bug filed for this) we could determine in which moment that job got stacked.
    Unfortunately some of these jobs might just finish because one of us accesses the machine and closes a dialog.

    Anamarias do you think we could have a report just for jobs that take longer than a certain amount of hours? By doing so we could have good data to help us investigate further. Could we also exclude these types of jobs to not be included your initial report? Maybe anything over 10 hours and make a note in the report.

    Armen Zambrano G.

    November 25, 2010 at 10:48 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: