Saturday 8 October 2022

don't ignore SIGBUS failures [Was, Re: +1 Maintenance Report]

Hi Bryce,

On Wed, Oct 05, 2022 at 06:37:06PM -0700, Bryce Harrington wrote:
> ### scikit-learn ###

> This package hits a 'bus error' on armhf. This issue seems not to have
> a bug report in Debian, however bus errors are mentioned on both Deb:
> #1008369 and #1003165. Both bugs have extensive discussion, however it's
> unclear if a fix is in sight; main suggestions appear to be to drop the
> architecture or skip the test case. Those may not be this same issue
> though; it seems "Bus error" is a generic error that's been happening
> for other tests on armhf. Upstream is also aware of armhf tests being
> in a really bad state.

> A previous +1'r also suggested skipping the failing test, so I've gone
> ahead and added test_dist_metrics to appropriate sections of both
> d/rules and d/t/python3, and uploaded this as 1.1.2+dfsg-5ubuntu1.
> Since the testsuite bails as soon as it hits the first bus error, it's
> possible there will be subsequent tests failing the same way, in which
> case maybe just keep adding excludes.

> I've filed update-excuse bug LP: #1991621 with the above info.

LP: #1991621 is now resolved with the upload of a fix for the unalignment
error (per your comments above, somehow you were uploading a version
1.1.2+dfsg-5ubuntu1, which was already older than the 1.1.2+dfsg-6 that had
been in -proposed since September 12?), but I think this is still worth
responding to.

You mention concurring with a previous +1'er about skipping the failing
test. But I see no evidence here of analysis of whether this is a buggy
test vs. buggy code under test, or the impact to users of this code failing.

The purpose of +1 Maintenance is to unstick packages from -proposed - but
that does not mean getting new versions of packages into the release pocket
at all cost! We should not be ignoring indicators that a new version of the
package is going to be buggier than the status quo for users (of some
architecture / in general).

Bus errors on armhf don't cause build failures in Debian because Debian's
armhf builds are done on older ARMv7 CPUs that don't have a problem with
unaligned access; so these failures are only noticed ad hoc and after the
fact. In Ubuntu, our armhf builds all happen virtualized on top of ARMv8
CPUs that DO care about alignment. I would argue that bus errors ought to
be considered RC bugs on Debian, because nowadays users are more likely to
be running armhf on ARMv8 hardware. But for Ubuntu, they DEFINITELY should
be blockers, because it's not just our builders that are affected (which
alone makes it difficult to have such code in the archive, causing failures
in arbitrary reverse-dependency stacks at build time), the vast majority of
systems supported today by Ubuntu's armhf port are also ARMv8 and, I think,
affected by bus errors on unaligned access.

So I don't think we should be skipping tests in packages showing that our
armhf binaries will SIGBUS on certain CPUs, in order to get the packages to
migrate.

Now, that doesn't mean that we need to invest extraordinary effort into
*fixing* such alignment errors on armhf. +1 Maintenance folks are not
expected to all be deep architecture experts / C experts and fix all issues
like this. Sometimes the right answer is instead to remove support for the
architecture from the package (by talking to the archive admins and getting
the binaries from the old version removed). Especially for packages that
exist to support intensive numeric operations (like scikit-learn), this is
probably a reasonable answer for armhf if it can be done without a mess of
reverse-dependency handling (... which is NOT true for scikit-learn),
because armhf is not an interesting architecture for heavy math stuff in
production. But sometimes the answer is just to leave it be in -proposed
until someone can come along and fix it.

Also, Seth makes a very important point on the bug that unaligned access
impacts *performance* on a lot more CPUs than those which generate SIGBUS:

https://bugs.launchpad.net/ubuntu/+source/scikit-learn/+bug/1991621/comments/1

Thanks,
--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer https://www.debian.org/
slangasek@ubuntu.com vorlon@debian.org