Why metadata sometimes lags packages for CentOS Updates

by Karanbir Singh Email

Sometimes the CentOS-5/updates repository gets into a state wherein people can see the updated packages in the repo using a browser ( eg at http://mirror.centos.org/centos/5/updates/i386/RPMS/ ) but when they try and get the updates on their machine : yum is unable to 'see' these packages. The reason for this is that while the physical rpm packages have been pushed out to the mirrors, the yum metadata has not been updated. And yum relies on this metadata to workout what packages are available.

I'll try and briefly explain why this happens.

The CentOS mirror network is setup in layers, the first two levels of this network constitute the core - and are not available publicly. The third layer is what most people see at http://mirror.centos.org/ and the large mirror networks like http://mirrors.kernel.org/centos/ , http://ftp.heanet.ie/mirrors/centos/ , http://www.mirrorservice.org/sites/mirror.centos.org/ etc ( there are over 100 of them! ). The fourth layer in the mirror network are the smaller - but still very important - mirrors that sync from the third layer machines. The fifth and final layer is the private and internal company wide mirrors run by admins within their own networks.

When a new update is issued, the updated content is pushed into layer one. From there it makes its way down to layer two and then onto layer three. At this point, the content is now publicly visible, however it might not be on layer four and five machines. There are quite a few more complexities involved in the process, but two issues worth noting at this point are that (1) the whole process is automated and (2) the 'check and refresh' frequency is fairly high. eg. content moves from layer one to layer two in almost real time.

So why the metadata lag ? What we want to try and do is make sure the update does not 'break' any process. So we want to make sure that packages are visible and available to some relatively large number of mirrors before people and machines start requesting them. Therefore the metadata lag. Here is a snippet of code from the release-to-production script, which should make it easier to comprehend :

do_genMetadata
rsync -Pvar --include="*.rpm" --exclude="*" * $SeedHost:$SeedPath
do_seedCheck
rsync -Pvar * $SeedHost:$SeedPath

What happens in this case is that the metadata is generated, and only the rpms are pushed upto the layer one machines. the 'do_seedCheck' function will then block the process till such time as it can see the rpms publicly visible on a random cross section of mirrors ( looping every five minutes ). Once that mark is reached, it will return and the regular rsync which then includes the metadata will get run. And as soon as this metadata is visible, yum will start pulling the updated packages for users.

The other thing to keep in mind is that this is not a one-off occurrence. The yum metadata *always* lags the rpms by sometime. Lets say X seconds. The value of X now depends on how long it takes for those tests to pass + how long it takes the metadata to work its way down the mirror's chain. In a majority of the cases, the time lag is just a few minutes. eg the PyXML and gd updates from earlier today went through in less than 20 minutes. On the other hand there are times when the lag could reach many hours. eg an OpenOffice.org update could delay metadata for upto 8 - 12 hours since the mirrors need to shift almost 1.3GiB just for that one update, per machine.

Can we speed things up a bit ? Absolutely. One way, that we are hoping to trial in the next few weeks, is to move atleast some part of the core-mirror-machines to start using a push-style update process rather than the existing pull mechanism. That would reduce drastically the amount of time machines sit out-of-sync. But more on that in another post another day!

- KB

3 comments

Comment from: Amos Shapira [Visitor]
Thanks for the explanation.

Any thoughts on using bittorrent or friends to push out the files more efficiently?
04/Jan/2010 @ 22:53
Comment from: Karanbir Singh [Member] Email · http://www.karan.org/
Amos,

BitTorrent isn't more efficient - its actually a lot more wasteful in this context. Both in terms of mirror machine resources and also in terms of bandwidth being used and shared.

- KB
05/Jan/2010 @ 07:18
Comment from: Oren [Visitor]
Cool. It's very interesting to read about your work on CentOS.
06/Jan/2010 @ 05:06

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)