Register

To become a member of ITProPortal Register here.

Already a member? Login here

Please register below. All we need is a valid email address and a password.

Please use a real email address as we need to email you to confirm your account.
Must be at least 6 characters long.

Benefits of joining ITProPortal:

  • Unlimited Access to Special Reports and White Papers
  • Exclusive offers and discounts
  • Free entry to all competitions
  • Access to beta sections of ITProPortal.com

Login to your account



Forgot your password?


Where to begin with Deduplication

Where to begin with Deduplication
  • Digg del.icio.us reddit Facebook

De-duplication in itself is easy to understand – optimised storage capacity usage by eliminating duplicated data. However the devil is in understanding the different technologies, techniques and implementations in the market and relating these to customers specific needs.

Instead of storing data multiple times, de-duplication enables the data to be stored once and uses that single instance as a reference.  The techniques used to do this vary.  For instance, we could look for complete files which are the same, and only when these are a complete match with each other, is a single instance created. 

Alternatively we could look at files which are basically similar (for example revisions of a draft document) and create a single instance of a master file only saving the byte level differences between this and subsequent files.  So which of these approaches is best?  As always, the answer is not straightforward.

If we look at the first of these – working at a file level, rather than a byte level, there are well established techniques such as CAS – Content Addressable Storage.  With this approach the contents of the file are put through a mathematical mincer and the end product is a unique identifier which is attached to the file. 

If exactly the same file exists somewhere else in the system, the mathematical mincer will produce exactly the same identifier – indicating a duplicate file which can be made into a single instance. 

Using this approach, every time a spelling mistake is corrected, or punctuation is added to a document, a new identifier would be created and both versions of the document stored.