Tuesday, 13 May 2014

Re: errors.ubuntu.com and upgrade crashes

Hi Matthew,

I had some extra time to think about this reply as Im reading the list in
digest -- I will answer this mail with two parts mails, one about the concrete
issue at hand and a second one generically about errors.ubuntu.com and what I
think is wrong with it on a more fundamental level for me as a package
maintainer (and thus user of it).

First on the concrete issue:

On Tue, May 13, 2014 at 10:24:20AM +0100, Matthew Paul Thomas wrote:
> You can. Set "Most common of these errors from" to "the date range",
> then enter the dates, for example 2013-10-16 to 2013-10-20. The result
> is not as you remember: over that period, bug 1219245 was not in the
> top 50 at all, whereas it was #42 for the equivalent period around the
> 14.04 release.

I tried that, just for fun with the range 2014-04-13 2014-04-19 and got:

libreoffice-core all 25
libreoffice-core 14.04 120

and went back from 14.04 to all and got a "the list of most common errors could
not be loaded" -- from that it is a/ obvious, that the numbers are not
absolute, but normalized in a way[1] that is making them uncomparable b/ these
errors use to happen always to me after two or three changes to the selection
in general, TBH thus made me mistrust this data for all but the most basic
searches as I am always unsure if I see real data or old data with some stalled
JSON request.

> > While there is no good reproduction scenario in the bug reports,
> > there is one report claiming it crahed "while installing a font"
> > and another it crashed "during an upgrade". This leaves me with the
> > suspicion, that the crash is actually people leaving LibreOffice
> > running during a release upgrade (which is brave and a nice vote of
> > confidence, but not really a supported scenario).
>
> "Supported" is a weasel word. I've never understood why Ubuntu lets
> people have apps running during an upgrade, because that has many
> weird effects. But Ubuntu *does* let people do that. And as long as it
> does, Ubuntu developers are responsible for the resulting errors.

This is hardly a LibreOffice issue -- any interactive application will have
such issues, esp. if it has any kind of state. Thus the solution would be for
the updater to search and warn about closing such applications.

> > While those upgrade issues should be a concern too, as-is it seems
> > to me they are overblown in their importance and we dont have a
> > good way to look if they happen in regular production use after
> > upgrade.
>
> With respect, I don't see that you have any justification in deciding
> that this particular issue is "overblown". A crash in LibreOffice is
> just as bad whether it happens during an upgrade, during a full moon,
> or during the Olympic Games.

All bugs are created equal? Not quite, as on errors.ubuntu.com we are ranking
them esp. based on "frequency" -- with the implicit assumption that a high
frequency bug will keep its high frequency throughout the lifecycle of the
release. 'Upgrade-only bugs' break this assumption.

> If you think it's unfair somehow that
> apps are expected to keep running during upgrades, then fix the
> upgrade process so that apps can't run during the upgrade. Don't just
> filter out those crashes as if they aren't happening.

In a perfect world, I would cover all LibreOffice crashers with the same vigor.
I a perfect world, I also would have at least a 10 head team to do this (and
all the other things needed for LibreOffice). Unfortunately we are not living
in that world, but in one were I have to multiply frequency with severity and
affected users and take care of those scoring highest. The broken assuption
above is making errors.ubuntu.com much less useful for this.

> > ... e.g. trivially: mark crashers 48hours after the upgrade
> > as 'potentially an upgrade sideeffect' or somesuch?
> Probably not retroactively. But I imagine it would be fairly easy to
> add info to future error reports asking if do-release-upgrade (or
> whatever) was running at the time.

That would be very useful indeed. I assume the version data sent to
errors.ubuntu.com in these cases to be wrong and poisoning the well anyway:
- the client will send an error report with the application version the package
manager reports to be installed
- the application crashed will actually be a older, different one.

It would be ideal to fingerprint the crashed binary and compare it to the
version installed on disc, and skip reporting if those differ. But as a
fallback, the 'mark first 48 hours' thing would be a pragmatic solution to
prevent such wrong data to poison the stats.


Now, apart from the concrete issue at hand for the general things about
errors.ubuntu.com, and how it would become more useful for me.

> Unfortunately, this calculation goes to hell on release day. All of a
> sudden there are a gazillion new machines with the new version of
> Ubuntu on them. And of those, some fraction will report their first
> error. But that fraction are the only ones we know exist at all. So the
> denominator is much too low -- making the calculated error rate much
> too high.

Thus the nice charts we are plotting on the page are mostly useless and neither
help me find the most common issues of my package or the distro in toto[2].

> This is why the calculated error rate for every new release spikes on
> release day, and corrects itself over the next 90 days. It's also why
> the calculated error rate for 13.10 plummeted at the 14.04 release:
> lots of 13.10 machines were upgraded to 14.04, and so from the error
> tracker's point of view they're still 13.10 machines that suddenly
> became error-free.

So what are the charts actually telling us? To me they show more artifacts of
their normalization than useful information about the stability of a release:
- for the first 90 days, there is no good normalization -- thats already 25% of
a release cycle
- for the last month, people are already starting to migrate to the next
release, so the normalization goes off again (another 16% of the release cycle)

IMHO, _if_ errors.ubuntu.com plots anything, it should plot the months 4 and 5
of the life of each release cycle over each other. Likely that chart would be
much more boring (and unfortunately rather too late for us to take action upon
it), but it is the only sensible chart to create from the data.

> If anyone would like to fix this, it's just a simple matter of
> programming. ;-) <http://launchpad.net/bugs/1069827>

Im not exactly sure how normalizing this in a different way would help me
identify high frequency bugs faster, so fixing the charts not too high on my
priority list. Things that would be much more interesting to me would be stuff
like:
- get the absolute counts for a LibreOffice version and the distro release for
a stacktrace and the estimated size of deployment
- find correlations between the counts of multiple stacktraces:
- hinting at two bugs caused by the same root cause
- if one trace has a good reproduction scenario and the other does not, this
would be very helpful etc.
- much more stuff like that.

Critical for that would be to be able to download the data and see what works
and what does not for identifying issues by fiddling around in some python
script or ad-hoc data mangling in a spreadsheet. I certainly wont program a
solution "into the blind" if I havent found it helpful in a few cases ad-hoc at
least.
Once I proved myself that a specific tactic/calculation provides me with
helpful information for cornering bugs, I might consider implementing a generic
solution in Errors directly -- but before that, I wont hassle with huge
discussion, documentation and presentation tail that ensues.

So, I would be interested in making Errors more useful -- but a prerequisite
for that would be the ability to get a simple CSV file for subsets in the
form:

package, package version, distro release, stack trace id, day, crash count, est. deployment size

easily from the page so I can play with that and find out what works and what
does not in identifying bugs without hassleing with a flaky JavaScript monster
that I cannot easily get data out in a processable form. Once I have that and I
have the 10 head team to take care of the rest of the issues, I might come back
and look at making the plots look nicer on the webpage[3].

Best,

Bjoern

[1] In the meantime I searched some more and found that the axis _should_ be
labeled "Errors per machine per 24 hours", but isnt.
https://wiki.ubuntu.com/ErrorTracker#errors.ubuntu.com

[2] Roberts post at
http://bobthegnome.blogspot.de/2014/05/errorsubuntucom.html confirms this,
it finds:
- people use their machines more on weekdays
- people dont run ubuntu+1 on Christmas
- people started to use the beta in March
- people migrated from 13.10 to 14.04 quickly
- people migrated from 12.04 to 14.04 slowly
- people dont migrate from 12.10 much
All of which are observation on deployment size and migration, none of it
is a measure of stability/crash frequency or helpful in identifying the most
painful bugs -- even relative.
[3] Well, actually having that data, I might come up with an better
normalization and contribute to bug 1069827 too. ;)

--
ubuntu-devel mailing list
[email protected]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel