Not logged in » Login
Sep 10 2014

Data Deduplication with FUJITSU ETERNUS CS800, Pt. 1: Methodology Recap

/data/www/ctec-live/application/public/media/images/dedupe/29522_ETERNUS_CS800_Rack_scr.jpg

In recent years, data deduplication has become one of the most popular and most easily misunderstood storage technologies at the same time. Misled by industry buzz, customers heavily invested in deduplication-ready platforms only to find that the new systems could hardly meet the promised data reduction rates of 20:1 or more. Our four-part-blog tries to clear up some of the most frequent misconceptions about deduplication, and explains how Fujitsu's data protection appliances avoid its pitfalls. This first part discusses the most important 'dedupe basics.'

Once a mere projection, the so-called data explosion has long turned into a reality for enterprises of all kinds and sizes. Ask any storage administrator – and he'll tell you how he works tirelessly to make sure that the flood of data entering his IT department each day doesn't drown out the entire company. Hence it's only logical that nearly every technology that promises to prevent such a catastrophe will get a fair chance of being adopted. Sometimes the hopes for betterment are justified, but more often than not the hype surrounding these solutions is unjustified. This typically leads to widespread user frustration and ill repute, even if the inventors were just trying to help. And that's precisely what might happen with data deduplication if customers continue to see it as a miracle cure for every possible storage issue. It is not – but may still offer enough advantages when used in proper application scenarios.

Data Deduplication – the Concept
With these considerations out of the way, let us now focus on how data deduplication works and what it can help companies achieve. This is in fact essential, because the technology only befits certain types of data, data sets and applications while it's not suitable for others. As the term indicates, the purpose of deduplication is to actually reduce the amount of information an organization needs to store in and transmit between various locations. Quite obviously, this will work best whenever an application must handle large amounts of identical, repeated data segments, for instance if you need to run full backups or replicate huge databases at regular intervals. In all such cases, data deduplication will usually achieve the advertised deduplication rates of 20:1 or higher. But there are also many scenarios where this is practically impossibleoffice applications in particular may generate tons of documents that all look a lot alike, but the amount of Word, Excel or PowerPoint files that are truly identical reaches 15% at best.

Technically, deduplication relies on a data reduction methodology that systematically substitutes reference pointers for redundant blocks (or segments) in a specific data set. This is possible because files and datasets residing on disk-based storage systems are rarely stored in sequential or contiguous blocks. Instead, the segments are kept wherever there is a free spot on a single-disk system, or in the case of RAID storage, written to multiple blocks that are striped across multiple disk systems. Information about where they are located can be found in the operating system's file system, where so-called reference pointers serve as 'road signs' that indicate where the blocks physically reside. Given this basic structure, it is easy to simultaneously use multiple pointers in different sets of metadata to reference one and the same data block, as shown in Fig. 1 below.

Fig. 1: Data Deduplication Methodology – Overview

 

Fig. 1: Data Deduplication Methodology – Overview

Essentially, when a storage pool is first created (A), the application in use also retains one set of metadata (MD1) with pointers to the stored blocks. Whenever new datasets are added (B), a separate metadata image (MD2) is added for each, along with the new blocks. In our example, MD1 continues to point to the original blocks, while MD2 points both to some of the original blocks and to new blocks. For each backup event, the system stores a complete metadata image of the dataset, but only new data segments are added to the block pool. In essence, this means that users only have to keep one instance of a unique data block instead of a plethora of copies – operating systems and applications they work with will swiftly find these pieces of information just following the reference pointers. This concept is common knowledge among storage vendors, and so they used it to develop numerous storage utilities – such as snapshot tools – even before deduplication became popular. But although the underlying methodology is the same, deduplication products often show major disparities in efficiency. If you want to know why that is and what it has to do with different block lengths, please check back to read Part 2 of this blog at the end of the week.

Walter Graf

Stefan Bürger

 

About the Author:

Walter Graf

Principal Consultant Storage at Fujitsu EMEIA

About the second Author:

Stefan Bürger

Head of Tape Automation at Fujitsu

SHARE

Comments on this article

No comments yet.

Please Login to leave a comment.

X

Please login

Please log in with your Fujitsu Partner Account.

Login


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now