chunker

Thin package implements Content Defined Chunking (CDC) based on a rolling Rabin Checksum.

Choosing a Random Irreducible Polynomial

The function Pol.getRandom() returns a new random polynomial of degree 53 for use with the chunker. The degree 53 is chosen because it is the largest prime below 64-8 = 56, so that the top 8 bits of an ulong can be used for optimising calculations in the chunker.

A random polynomial is chosen selecting 64 random bits, masking away bits 64..54 and setting bit 53 to one (otherwise the polynomial is not of the desired degree) and bit 0 to one (otherwise the polynomial is trivially reducible), so that 51 bits are chosen at random.

This process is repeated until Pol.irreducible returns true, then this polynomials is returned. If this doesn't happen after 1 million tries, the function returns an error. The probability for selecting an irreducible polynomial at random is about 7.5% ( (2^53-2)/53 / 2^51), so the probability that no irreducible polynomial has been found after 100 tries is lower than 0.04%.

Verifying Irreducible Polynomials

During development the results have been verified using the computational discrete algebra system GAP, which can be obtained from the website at http://www.gap-system.org/.

For filtering a given list of polynomials in hexadecimal coefficient notation, the following script can be used:

# create x over F_2 = GF(2);
x := Indeterminate(GF(2), "x");

# test if polynomial is irreducible, i.e. the number of factors is one;
IrredPoly := function (poly);
	return (Length(Factors(poly)) = 1);
end;;

# create a polynomial in x from the hexadecimal representation of the;
# coefficients;
Hex2Poly := function (s);
	return ValuePol(CoefficientsQadic(IntHexString(s), 2), x);
end;;

# list of candidates, in hex;
candidates := [ "3DA3358B4DC173" ];

# create real polynomials;
L := List(candidates, Hex2Poly);

# filter and display the list of irreducible polynomials contained in L;
Display(Filtered(L, x -> (IrredPoly(x))));

All irreducible polynomials from the list are written to the output.

Background Literature

An introduction to Rabin Fingerprints/Checksums can be found in the following articles:

Michael O. Rabin (1981): "Fingerprinting by Random Polynomials" http://www.xmailserver.org/rabin.pdf

Ross N. Williams (1993): "A Painless Guide to CRC Error Detection Algorithms" http://www.zlib.net/crc_v3.txt

Andrei Z. Broder (1993): "Some Applications of Rabin's Fingerprinting Method" http://www.xmailserver.org/rabin_apps.pdf

Shuhong Gao and Daniel Panario (1997): "Tests and Constructions of Irreducible Polynomials over Finite Fields" http://www.math.clemson.edu/~sgao/papers/GP97a.pdf

Andrew Kadatch, Bob Jenkins (2007): "Everything we know about CRC but afraid to forget" http://crcutil.googlecode.com/files/crc-doc.1.0.pdf

Modules

example
module chunker.example

Undocumented in source.

polynomials
module chunker.polynomials

Undocumented in source.

rabin
module chunker.rabin

Undocumented in source.

Members

Functions

byCDChunk
Chunker!R byCDChunk(R source, Pol pol, uint averageBits, size_t minSize, size_t maxSize, ubyte[] cbuf)
Chunker!R byCDChunk(R source, Pol pol, ubyte[] cbuf)

Constructs a new Chunker based on polynomial pol that reads from source. Params: source = An input range of chunks of bytes, such as that returned by File.byChunk or ubyte[].chunks or simply ubyte[].only. pol = An irreducible polynomial from F_2[X]. Use Pol.getRandom() to generate a random one. averageBits = Allows to control the frequency of chunk discovery: the lower averageBits, the higher amount of chunks will be identified. The default value is 20 bits, so chunks will be of 1MiB size on average. minSize = Minimum size of emitted chunks. Chunk boundaries that occur less than minSize bytes from the chunk start are ignored. maxSize = Maximum size of emitted chunks. If a chunk boundary is not encountered after maxSize bytes from the chunk start, it is forcibly split at that point. cbuf = A buffer to store chunk data. When null (default), a new buffer is allocated on construction of length maxSize. Returns: An instance of Chunker, an input range of Chunker.Chunk, which contains the chunk data, and the fingerprint value when it was cut.

Structs

Chunker
struct Chunker(R)

Splits content with Rabin Fingerprints.

Variables

chunkDefaultAverageBits
enum size_t chunkDefaultAverageBits;

Aim to create chunks of 20 bits or about 1MiB on average.

chunkDefaultMaxSize
enum size_t chunkDefaultMaxSize;

Default maximal size of a chunk.

chunkDefaultMinSize
enum size_t chunkDefaultMinSize;

Default minimal size of a chunk.

Meta

License

Copyright 2014 Alexander Neumann. All rights reserved. Use of this source code is governed by a BSD-style license that can be found in the LICENSE file.