Wednesday 15 May 2013

Mysterious Python pyc file corruption problems

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAEBCAAGBQJRk/HoAAoJEBJutWOnSwa/e4IP/jlDujeN2OqkTmt1hqQj2Vfd
q3/yb7Bza2RKdNYbL306fSP86e8LPvUQbnwbPsTE6Pj2ddXr+9Oca0MMep6AN+B6
ijqdjXUwCHoDfSbJ4XobF5EGpgZpGcGkG3wh0ywxQ5pqJs7jM7eTlxJR162vI5cG
xda5oGsbIoMRlVbHf4xa7KkK1vxaN7vh6Kaz6razKAU0pvah8QYurkVbJBW1Wkah
DWYzhUObLjP98EdzFjd3wJ2V4Y68SMgq8ox2rXnTTf2aFlh5VjSU7TpgjbciR5RK
02oYWWcf9oDlxxYTV3JBG/v9jTBsIeTLegDSZvjT91xDWVeQnIEUwZLqJwzX9gli
iRP30E7Ei+OysEL5gLmnLJiRNccAPRRD57j31Gx98V0+dR7krQXuX1o5kUpSRdp5
nB45dxHNomyGZsmBxfTayMYpkOr3a7Fri0kkqfCgH4qyvAqo/491oesWJDFQssCv
rhhv6Hx36NHjyDvZbi0qHQo2GbkjRqtCROsBDswv7lANU/FefXDeMIxHWtYQsBz+
Mrvcv5C+a8RJJ3lBa0XbNUS/E3E1cHVSRouPo17LyuJZLGouhMYfkjgvS2Ia31jG
HjlgYIrci8rDtmGm9Ii1Pdm5QM//VrLUrWk4LD35EzvSUqCFcsBfKkUIMTkmei0u
EnSaWFWe/hAeE7KXChnl
=uT29
-----END PGP SIGNATURE-----
Hello Developers,

I am trying to debug and fix a particularly vexing problem in Python that
manifests on Ubuntu in several different ways. I have a hypothesis about the
problem, but there are still some mysteries and I don't know how to reproduce
it. I think I can fix it, but I'm sending this message (and soon, another one
to python-dev with more technical details) in the hopes that you might have
other ideas about how it can happen, or have a reliable way to reproduce the
bug.

The problem can show up in any package, but Brian has started to collect a
number of bugs that all seem to be related (and I think Steve is going to open
a megabug to dupe them all to). The common way this manifests is a traceback
on an import statement. The actual error can be a "ValueError: bad marshal
data (unknown type code)" such as in LP: #1010077, or an "EOFError: EOF read
where not expected" as in LP: #1060842. We have many more instances of both
of these.

Both of these exceptions come from Python's marshal code (marshal.c). marshal
is the low-level serialization protocol used to cache Python byte code into
.pyc files, so both of these exception imply corrupt .pyc files, and in fact,
the workaround is always to essentially blow away the .pyc file and re-create
it. (Various different techniques can be used, but they all boil down to the
same thing.)

Another commonality is that this bug -- so far -- has not been observed in any
Python 3.3 code, only 3.2 and earlier, including 2.7 and 2.6. If this holds
up, it's a crucial clue, because the import machinery was flipped over to the
pure-Python importlib in 3.3, and this includes an atomic renaming of the .pyc
file during write. All earlier versions of Python used a C implemented
version of import, which opens the .pyc files exclusively (O_EXCL|O_CREAT) but
do *not* do an atomic rename.

This leads me to hypothesize that the bug is due to an as yet unidentified
race condition during installation of Python source code, which is normally
when we automatically byte compile the source to .pyc files. This can happen
at package installation/upgrade time, or during a ubiquity run during a fresh
install. In each of these cases there *should* be only one process attempting
to write the .pyc, but my guess is that for some reason, multiple processes
are trying to do this, triggering a truncation or other bogus content of .pyc
files. Even in Python < 3.3, it should not be possible to corrupt a .pyc when
only a single process is involved, due to the import lock and/or GIL. The
exclusive open of the .pyc file is clearly not enough of a protection in a
multiprocess situation.

I think the list of errors we've seen is too extensive to chalk up to a
hardware bug, and I think the systems involved are modern enough to not be
subject to file system data loss. There could be a missing fsync somewhere
though that might be involved. I think it's doubtful that buggy remote file
systems (e.g. NFSv2) are involved. I could be wrong about any of that.

I have not succeeded in writing a standalone reproducer using Python 2.7.

So, the mystery is: what process on Ubuntu is exploiting holes in the
exclusive open and causing this problem?

Even without identifying the actual culprit(s), this upstream bug is probably
the root cause: http://bugs.python.org/issue13146

The bug is closed because the fix was applied to Python 3.3 (see above), but
it was not backported to earlier versions. I think it would not be that
difficult to backport it, and will talk to my fellow Python core devs to
determine whether and where it should get backported. It probably makes sense
to get it into 2.7, and maybe 3.2, but nothing else.

In either case, it almost certainly makes sense to get the fix into Ubuntu's
Python 2.7, at least SRU'ing it to Raring. I'm not sure whether it makes
sense to try to get such a fix into earlier Ubuntu releases or Python versions
on Ubuntu. The thing is: while the problem is mysterious and annoying, the
workaround is fairly simple, and I would love not to have to care about Python
2.6 or 3.2. :)

Thoughts are welcome, though remember that I'm going to engage python-dev on
the same topic (not cross-posted).

-Barry