push rather than pull based rsyncs for mirrors

by Karanbir Singh Email

Has anyone given though to doing mirror updates via a 'push' rather than a 'pull' mechanism ?

Basically, when new packages are available the machines that get updates would do a rsync to the machines down the order, pushing the updates out. It would allow us to get more updates, out faster and also not need to waste cpu and i/o on repeated rsyncs that dont need to be run. It would also make it a lot more viable to run rsyncs with a '-c' option always set.

Yes, we could/would/should still leave in place some mechanism for people to setup new mirrors and also to do pull based rsync's as and when they need.

Just something to think about at the moment, and comments would be very welcome.

- KB

10 comments

Comment from: Blokje [Visitor]
My first though will be create a nice script which pushes the packages out to the mirrors with rsync :-)

A simple google came up with the following : http://ma.tt/2006/04/cross-datacenter-file-replication/

Looks like it has some nice possibilities like:
- Spread
- rsync deratives (lots of them)

But I'm will keep looking at this post cause I'm too really interested in the suggested options:-)

14/May/2009 @ 06:13
Comment from: Sotiris Tsimbonis [Visitor] · http://stsimb.irc.gr/
ClamAV does push-mirroring .. See http://www.clamav.net/mirrors.html and doc/mirrors/ ..
14/May/2009 @ 07:18
Comment from: Roderick van Domburg [Visitor] · http://www.railscluster.nl
We use something similar for our cluster configuration management. We do both push and pull.

Our puppetmasters (server) notifies puppets (clients) when they need to make a sync run. We have also configured the puppets to run every 15 minutes just in case, just to be sure.
14/May/2009 @ 08:03
Comment from: Frank [Visitor]
But if you have an erroneous push script all mirrors are affected or even worse an erroneous or malicious package within the repository would be widely spread in a 'short' time to lots of mirrors...
14/May/2009 @ 08:41
Comment from: Brian [Visitor] · http://directedge.us
There is a very well thought out analysis of push vs pull at infrastructures.org. Specifically, the push v pull page (http://www.infrastructures.org/bootstrap/pushpull.shtml) The short answer is that using client pull is far more desirable for many reasons.. it is easier to maintain; if 1 server is down it doesn't freeze up the process during the timeout; etc...

You could avoid unnecessary rsync by simply having each client check a version number or something in a file, then if it's different you run the rsync.
14/May/2009 @ 09:40
Comment from: Bill McGonigle [Visitor] Email · http://blog.bfccomputing.com
How about having an object on the master for which a Last-Modified HTTP header could be trusted? For a matter of a few hundred bytes every 1/5/15/60/whatever minutes per mirror and very low complexity on top of the existing infrastructure (one cron job) you could get most of what you'd like. Something fancy could probably be done with DNS if chasing ultimate efficiency is worthwhile.
14/May/2009 @ 15:26
Comment from: Bill McGonigle [Visitor] Email · http://blog.bfccomputing.com
Thinking about this some more, it might be better to use Bittorrent than to come up with a new ad-hoc protocol.

Much of what needs to be done is already built into bittorrent. I'm unclear about how clients behave if a .torrent changes at a tracker, but there are enough fields in a .torrent to indicate a change (creation date, sha hash, etc.). Bittorrent clients already have the ability to change on-disk data; I've heard of folks 'repairing' .iso corruption using bittorrent with block checksumming. Probably something like createrepo could be updated to generate .torrent files.

Official mirrors could use keys or IP addresses to maintain a 1 to many to many-more structure, or some clever use of the WebSeeding algorithm could even reduce the load on masters and mirrors and/or increase the speed at which mirrors catch up.
21/May/2009 @ 22:59
Comment from: Wilfred [Visitor]
Why don't you look at debians system?
http://www.debian.org/mirror/push_mirroring
26/May/2009 @ 18:19
Comment from: Karanbir Singh [Member] Email · http://www.karan.org/
Wilfred.

Yes, I've looked at their setup - someone pointed it out a few days back. Somehow, I was totally unaware of them already doing this.

The more I think about it - the more I am convinced we really need to move to push based mirroring, atleast for the internal .centos.org mirrors initially. It might take a bit longer to setup the mechanism that external mirrors could opt into to have their mirrors updated automagically via push. Working out ACL's and access rights is going to be fun.
26/May/2009 @ 18:25
Comment from: m [Visitor]

From the CentOS user point of view I don't think this is a real issue.

Maybe a monitoring page for the mirrors would be nice, like qmail.org, but that's all.

Enterprise users must run they own internal test circles anyway, so few hours late to get the rpms does not matter at all and I think home users can also wait few hours.

You guys do an excellent work, thanks a lot!
21/Jun/2009 @ 04:18

This post has 6 feedbacks awaiting moderation...

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)