Why we needed to reissue the CentOS-5.6/x86_64 ISOs
What was the issue:
There was an invalid rpm ( eclipse-ecj ) in the x86_64/os tree. Premature EOF in rpm payload.
Early morning on the 7th James Hogarth, BSkyB Entertainment reported to the QA team that there was an issue with the eclipse-ecj package shipped in the CentOS-5.6/x86_64 os tree. Stephen Walsh and Manuel Wolfshant tracked the issue back to the primary seed machine and confirmed the issue was present in not only the os tree, but also on CD3 and DVD1 of the x86_64 distribution.
Reason for this issue:
The CentOS distribution is composed on the buildservices, but then transferred over to the distro-build machines where the installer is built. There are automated tests that run at both of these locations. The rpm content tests ( rpm -K ) for md5's as well as gpg key. The distro had passed this test at both locations. The output from the distro-build machines is the actual package tree, the installer code, isos for cd's and dvd. This is transferred using rsync to the staging machine from where we start the release process ( initially to QAMachines when in qa mode; or into the mirror.centos.org network when in release mode ). There are no tests done at the staging machine. Packages are transferred one at a time, rather than as a whole tree ( mostly for legacy reasons ). It seems that the transfer for eclipse-ecj did not complete ( driven by the fact that its OK on one side and not on the other ).
We had to rebuild the torrents, DVD isos, CD isos and update the CentOS-5.6/x86_64 distribution. There was no impact to CentOS-5.6/i386.
What did we do to fix things:
To address this issue, we had to issue a new set of ISOs, and since the package content was changing, rebuild metadata. Which in turn needed a complete rebuild of the ISOS ( but not the install tree ). Over the course of the morning, Fabian and Manuel were able to test the new tree, and our automated tests ran through for the ISOs
There was also a lot of rollback work that needed to be done, including handling the torrent tracker, issuing new torrents, making sure the mirror network etc. Much of which is done manually; and the main reason things took almost 18 hrs to resolve.
Steps taken to ensure this problem does not happen again:
I've now moved a large number of tests to the staging machine as well, including the rpm tests. This adds an additional 3 hours to the process, but its a worthwhile safeguard.
I hope this helps clear up things. Also, md5sum and sha1sum for ISOs as well as torrent files are published along with the torrent files and available on all mirrors. They will also be mentioned in the actual CentOS-5.6 Release Announcement. Everyone should check to make sure they get the right ones.
If you get the new .torrent files and drop in into the same place as the older ones did, you should see most of your data be reused ( 30% on the DVD and 86% on the CD's ).
As a loyal CentOS user (and being also a curious system administrator) I appreciate a lot being informed on what's going on in the CentOS development :)
rsyncor some other whole-tree transfer method with verification too?
> The output from the distro-build machines is the actual
> package tree, the installer code, isos for cd's and dvd.
> This is transfered using rsync to the staging machine
It is unclear how rsync could have damaged the RPMs inside the ISOs. Did you mean, that the distro-build machines do NOT build ISOs, but rather the staging machine?
Keep in mind that these machines are all located all over the world! If someone wants to contribute a couple of beefy machines in the London area ( or can ship them here! ) please get in touch and we can try fixing that traffic issue a bit :)
I have access to Rack space here in London, UK that has fairly good network connections. If we can get a few decent spec machines into that place we can do most of the build, staging and seeding from a single place. Not only will that reduce these kind of issues or the potential for such issues, it will also help with the iterative build/test cycle that we need to run through.
At the moment, the machines being used for these roles are spread over the world - and network issues kick in. eg. It can take upto 36 hrs for the first distro build to sync to the testing machines.
quad or more cores(eg. E5504 or better ) with 16gb or more of ram ( ecc ) and 4 disk spindles for 500GB or more storage ( sata are fine, sas would be better!) in the machines and atleast 2 GiB Ethernet ports. Definitely rack mount machines ( 1U would be ideal ).
Eg. HP DL360 G6 or G7's would be nice to have.