This weekend, I was hacking Obnam, my backup program, adding support for using the rsync algorithm for backing up only the changed parts of modified files. I was reminded once again how beautiful rsync is. There also seems to be very few implementations of it in pure Python, which is a shame, since all the C implementations get bogged down in memory management and other trivia.

So I thought I'd whip up a very simplistic version of it just in case someone finds it useful. Code is in bzr at http://code.liw.fi/obsync/bzr/trunk/, and crucial parts are excerpted below.

A note on the implementation: I've aimed at simplicity, not speed. I've opted to use zlib.adler32 for the weak checksum rather than the rolling checksum from the original paper: the rolling checksum was fairly slow to compute in pure Python, so it seemed just as good to use the library function.

The code in the bzr branch also includes a main program to allow a UI slightly similar to the rdiff program, and a shell script to verify that the code actually works, for at least one test case.

(I'm not providing an explanation of how rsync works: I'm throwing out this blog entry in a few minutes' break. Read the paper, or Wikipedia, or ask me to explain in a later blog entry if you're lazy.)

block_size = 4096

def signature(f):
    while True:
        block_data = f.read(block_size)
        if not block_data:
            break
        yield (zlib.adler32(block_data), hashlib.md5(block_data).digest())

class RsyncLookupTable(object):

    def __init__(self, checksums):
        self.dict = {}
        for block_number, c in enumerate(checksums):
            weak, strong = c
            if weak not in self.dict:
                self.dict[weak] = dict()
            self.dict[weak][strong] = block_number

    def __getitem__(self, block_data):
        weak = zlib.adler32(block_data)
        subdict = self.dict.get(weak)
        if subdict:
            strong = hashlib.md5(block_data).digest()
            return subdict.get(strong)
        return None

def delta(sigs, f):
    table = RsyncLookupTable(sigs)
    block_data = f.read(block_size)
    while block_data:
        block_number = table[block_data]
        if block_number:
            yield (block_number * block_size, len(block_data))
            block_data = f.read(block_size)
        else:
            yield block_data[0]
            block_data = block_data[1:] + f.read(1)

def patch(outputf, deltas, old_file):
    for x in deltas:
        if type(x) == str:
            outputf.write(x)
        else:
            offset, length = x
            old_file.seek(offset)
            outputf.write(old_file.read(length))

I was looking at zsync/rsync in python, and your blog post is the only good example/resource short of a source tarball I can find.

I was wondering about speed though, and you said: "I've opted to use zlib.adler32 for the weak checksum rather than the rolling checksum from the original paper: the rolling checksum was fairly slow to compute in pure Python, so it seemed just as good to use the library function."

I wrote a little test script to compare the speed of a rolling checksum (defined by the rsync paper) and your alder32 version, and I get the rolling checksum to run about 1.5x faster for a block_size of 1024 (9.3s vs 15s for a 20mb file on my machine). If i upped the block_size to 4096, the rolling checksum ran more than 3x faster (9.5s vs 30+s).

The test functions I used are below, but I didn't check the results for correctness, just that they ran, so I may be doing something stupid:

def rolling(f, block_size=1024):
    M = 2**16
    data = f.read(block_size)

    # initial block cs
    idata = struct.unpack('{0}B'.format(len(data)), data)
    k = 0
    l = block_size-1
    a = sum(idata)%M
    b = sum(map(operator.mul, xrange(l+1, 0, -1), idata))%M

    # return cs
    yield a + (b << 16)

    # update cs
    rdata = f.read(block_size)
    while rdata:
        for i in xrange(len(rdata)):
            dk = ord(data[i])
            dl = ord(rdata[i])

            a = (a - dk + dl)%M
            b = (b - (l - k + 1)*dk + a)%M

            k += 1
            l += 1

            yield a + (b << 16)
        data = rdata
        rdata = f.read(block_size)        

def adler32(f, block_size=1024):
    data = f.read(block_size)

    yield zlib.adler32(data)

    rdata = f.read(block_size)
    while rdata:
        for i in xrange(len(rdata)):
            yield zlib.adler32(data[i:]+rdata[:i])
        data = rdata
        rdata = f.read(block_size)

Also, using Cython I was able to get the rolling version down to 2s on my machine, or ~ 10MB/s

Comment by Brian Wed Feb 15 02:34:42 2012