Not logged in » Login
Oct 18 2014

Data Deduplication with FUJITSU ETERNUS CS800, Pt. 4: Your Entry into Cloud Computing


In recent years, data deduplication has become one of the most popular and most easily misunderstood storage technologies at the same time. Misled by industry buzz, customers heavily invested in deduplication-ready platforms only to find that the new systems could hardly meet the promised data reduction rates of 20:1 or more. The final part of our four-piece-blog explains why it the FUJITSU ETERNUS CS800 is an ideal platform for building cloud infrastructures/private clouds.

In the previous chapters, we have discussed the technical aspects of deduplication, from the general concept to the advantages of a hardware-based approach and deploying the FUJITSU ETERNUS CS800 appliance. These explanations were necessary so we can assess its practical value, i.e. the way it helps data centers optimize standard procedures to reach new levels of flexibility and agility. Along with its massive, space- and time-saving impact on storage and backup, deduplication also yields particularly profound effects with regard to replication and recovery.

Applying Deduplication to Replication
As we have pointed out before, hardware-based deduplication carried out on the ETERNUS CS800 appliance not only enables IT departments to lower their disk capacity requirements by up to 95%, but also to dramatically save on the bandwidth needed to copy data over the network. This is particularly useful if a company maintains a 'standby system' at a remote site that can take over and run all IT processes in case the primary site suffers an outage. The same goes for large organizations whose branch offices regularly send copies of the files to a 'main data center' in the cloud. To achieve this, companies need to ensure that all data sets are always identical in both locations, i.e. they must replicate the information.

Generally, there are two approaches to replication, a synchronous and an asynchronous one. Synchronous replication, also known as mirroring, continuously maintains two primary, active data sets in the same state by transferring blocks between two storage systems at each I/O cycle. The main benefit of mirroring is that it allows for very rapid failovers if the primary site crashes. The downside is that such a setup considerably reduces overall performance, because servers at the primary site cannot continue to operate until they get a signal that both the local and remote write are complete. IT departments can mitigate the effect by setting up high-speed links between the two locations as well as caches for the replication software and applying proper management procedures, but of course this requires additional investments. Therefore, synchronous replication is typically reserved for very high value primary data in transaction-oriented applications that must remain available at all times.

Asynchronous replication also uses two storage systems in separate remote locations; in this case, the second system is allowed to 'lag behind' the primary one by some period of time. This works particularly well with non-dynamic, point-in-time images, including backup images – that is, whenever data sets don't need to be perfectly in sync. Asynchronous replication requires less bandwidth than mirroring and has far less impact on the primary applications; moreover, it is much easier to implement, provides protection against different classes of faults and reduces the use of removable media. In addition, the systems may always be configured in a way that the second one only falls behind one or two I/O cycles so IT departments get a near-perfect "mirror" at a lower cost and with much less effort.

As we have seen before, data deduplication substantially reduces the capacity and bandwidth requirements imposed on storage environments. The very same quality also turns it into a natural ally for synchronous as well as asynchronous replication, mainly because at a basic level, dedup-enabled replication works in the same way as dedup-enabled data stores: once an image of the original information is created, all it takes to keep the replica in sync with the source is the periodic copying and movement of new data segments added during daily operations, along with their metadata. A set of data that positively lends itself to replication are backup data, since they provide a point-in-time copy of primary data that's separated from the primary applications.

Technically, the process is rather simple. Administrators start by copying all data segments from one division of a source appliance to an equivalent division in a target appliance. This initial transfer can occur over a network, but due to the sheer amount of data involved, it's often easier to co-locate source and target devices during this 'first run' or to transfer the data using tape. After the source and target are synchronized, for each new backup event written to the source, the replication process only sends the new data segments. In other words, if only 1% of the source data is modified, then only this percent needs to go over the wire, and the required bandwidth will amount to 1/100 of the bandwidth that would be needed to replicate the entire data set from the source. As a result, replication can be carried out via standard WAN connections, and companies can get rid of the expensive high-speed links mentioned above.

With an ETERNUS CS800 in place, organizations may reduce bandwidth requirements even further thanks to a two-step checking process that takes place before the actual replication: before any data is sent to the target device, the replication software sends a list of the blocks available for replication to the target device (see Fig. 5). The target device then checks this list of against the index of data segments it has already stored, and returns a list of elements that are not yet locally available and must be transmitted if the source and the replica are to stay in sync. Afterwards, the source only sends copies of these 'required' data segments over the network. The entire replication process is fully transparent, starts as soon as the first backup data is written to the source, and ends after the metadata for the new backup image has been transmitted. At that point, the backup image is available for recovery from the target.



Fig. 5: Two-Step Checking Process Before Replication

Fig. 5: Two-Step Checking Process Before Replication


The ETERNUS CS800 replication software also allows multiple source appliances to point to the same target device, and replication normally takes place on a partition-to-partition basis. All the replicas on the target are stored in a common deduplication pool, meaning that duplicates are eliminated across all backup streams (images) that arrive from the various sources. That means, if the same blocks were backed up at source sites A and B, they only will be stored once if both point to the same target appliance. Likewise, the pre-transmission checks ensure that identical data segments backed up on different days at different source sites will not be transmitted again over the network. Only the metadata needs to be sent and stored. Thus, the checks help reduce the bandwidth needed for replication in distributed environments where users work on similar file sets.

Flexible Disaster Recovery
Lastly, deduplication's many benefits also have a very positive impact on disaster recovery (DR). Just like dedup-enabled replication, DR now no longer requires dedicated high-speed links; instead, it may be carried out over a standard WAN connection, thus making the process both more feasible and less expensive. What's more, organizations will find it easier to prepare for DR scenarios and be able to drastically minimize the use of removable media, such as tape or RDX, in creating emergency copies. Plus, they are no longer restricted to 'classic' DR architectures in which data sets are restored from a specific remote site. Instead, they may now choose a setup where restores to the source site are carried out via remote servers at a cloud data center. Alternatively, they could opt to keep standby appliances at the main site, which may be filled with an exact replica of the required data sets (up to the point when the outage occurred) and then sent back to the source site. With these additional options, organizations not only gain greater DR flexibility, but also the security that an adequate DR method is always available.

Walter Graf

Stefan Bürger


About the Author:

Walter Graf

Principal Consultant Storage at Fujitsu EMEIA

About the second Author:

Stefan Bürger

Head of Tape Automation at Fujitsu


Comments on this article

No comments yet.

Please Login to leave a comment.


Please login

Please log in with your Fujitsu Partner Account.


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now