Not logged in » Login
Sep 12 2014

Data Deduplication with FUJITSU ETERNUS CS800, Pt. 2: The Importance of Block Lengths


In recent years, data deduplication has become one of the most popular and most easily misunderstood storage technologies at the same time. Misled by industry buzz, customers heavily invested in deduplication-ready platforms only to find that the new systems could hardly meet the promised data reduction rates of 20:1 or more. Part two of our four-piece-blog explains in more detail how different approaches to deduplication work and why it is better to use variable instead of fixed block lengths.

Different Approaches, Different Block Lengths
The fundamental deduplication methodology is identical across the board, but its various implementations may differ with regard to important details. Perhaps the most important of these differences pertains to the block lengths that storage solution vendors use in their deduplication products.

Generally, there are two methods to chop up data sets (or streams) into smaller segments. The more common one relies on fixed-length block division and is particularly popular among backup software suppliers that include deduplication as a special feature within their offerings. This approach typically works best when general purpose hardware is carrying out deduplication because less compute power is required. The key disadvantage here is a lack of flexibility: the backup programs only treat data sets as identical if they are completely congruent with each other. Thus, any change in size to one part of the dataset results in changes to all blocks the next time the data set is stored or transmitted, as you can see in Fig. 2: Here, a small addition to block A not only triggers its transformation into block E, but also affects all other segments. In other words, the backup software 'sees' and keeps two separate datasets (eight blocks in total) that are 99.9% identical only because there are marginal discrepancies. It's pretty self-evident that this works against the initial purpose of deduplication – substantially reducing the amount of data a company must retain and move around.

Fig. 2: Dividing Data Sequences into Fixed-Length Blocks

Fig. 2: Dividing Data Sequences into Fixed-Length Blocks


To rule out this undesired effect, Fujitsu's deduplication technology divides the data sets into segments of variable length, using a pattern recognition method that detects identical blocks and block boundaries in different locations and contexts. Hence the boundaries may "float," so that changes in one part of the dataset have little or no impact on the storage solution's ability to recognize the rest as an exact match. As a result, it only adds a new copy and metadata for the block that has actually been modified, but leaves the rest untouched (see Fig. 3).

Fig. 3: Applying Variable-Length Segmentation

Fig. 3: Applying Variable-Length Segmentation


Here, block A changes to E after new data is added, whereas blocks B, C, and D are all recognized as identical to the same blocks in the first line. If we stored both sequences, we would have only five blocks instead of eight37.5 percent less than in the first example. In summary, deduplication based on variable block lengths yields much better results and helps companies achieve greater space, time and cost savings. But it also requires higher performance than regular general purpose hardware has to offer. This is the kind of performance users can expect from Fujitsu's ETERNUS CS800 Data Protection Appliance, which will be the topic of part 3.

Walter Graf

Stefan Bürger


About the Author:

Walter Graf

Principal Consultant Storage at Fujitsu EMEIA

About the second Author:

Stefan Bürger

Head of Tape Automation at Fujitsu


Comments on this article

No comments yet.

Please Login to leave a comment.


Please login

Please log in with your Fujitsu Partner Account.


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now