Understanding Deduplication. Complete Guide to Deduplication Methods & Their Impact on Storage and VM Backups
When considering your VM backup solution, key features such as deduplication are incredibly important. This is not simply from a cost perspective but also an operational one. While is it true that deduplication of your backup data can have considerable cost savings for your business, it is also true that the wrong type of deduplication can often have negative performance and contribute to a negative end-user experience.
This article will explore the various deduplication types including general Inline Deduplication and Altaro’s Augmented Inline Deduplication for your VM Backup Storage. We'll also cover deduplication concerns such as software interoperability, disk wear, performance and other important areas. Let's now take a look at what’s covered:
- Deduplication Basics
- File-Based Deduplication
- Block-Level Deduplication
- Post-Process Deduplication
- Inline Deduplication
- Augmented Inline Deduplication
- Deduplication Gotchas
- Software Interoperability
- Disk Wear
- The Impact of Augmented Inline Deduplication for VM Backups
Concerned about your VM machines and their data? Download now your Free Enterprise-grade VM Backup solution
- 6 Key Areas to Consider When Selecting a VM Backup Solution
- Differences Between VMware vSphere, vCenter, ESXi Free vs ESXi Paid, Workstation Player and Pro
- How to Enable SNMP on VMware ESXi Host & Configure ESXi Firewall to Allow or Block Access to the SNMP Service
- How to Enable or Disable SSH on VMware ESXi via Web GUI, vSphere Web GUI (vCenter), vSphere Client and Shell Console
In fundamental terms, deduplication is the process of minimizing the amount of physical storage required for your data. In this article, we are using your VM backups as the data subject.
While physical storage costs are improving year on year, storage is still a considerable cost for any organization which is why deduplication techniques are being included into common data handing products such as backup software for your Virtual Machines.
There are various forms of deduplication available and it’s imperative to understand each one as all of them have various cost-saving vs performance trade-offs.
File-based deduplication was popular in the early days of deduplication however this method’s shortcomings became quickly apparent. With this method, files would be examined and checked to ensure identical files wouldn’t be stored a second time. The problem here was that much of a file could be identical to other files despite being named differently and having a different time-stamp. Furthermore other file-level differences would make the deduplication engine to mark the file as unique which would force the whole file to be backed up.
The end result is a significant amount of data being backed up multiple times, reducing the efficiency of the file-based deduplication engine.
Block-level deduplication is the evolution of file-based deduplication which successfully addresses the shortcomings of its predecessor. With this method, the deduplication engine now examines the raw blocks of data on the filesystems themselves. By concentrating on the raw blocks of data the deduplication engine no longer worries about the overall file a block is part of and can accurately understand the type of data the raw block contains.
The end result is a very efficient and intelligent deduplication engine that is capable of saving more space on the backup target.
To help better understand the block-level deduplication engine, we’ve included an example of how this process works with Altaro’s VM Backup solution. Our example consists of 3 VMs and the diagram shows how each VM’s data is broken into different blocks (A to E).
In Phase 1, block-level deduplication is performed across each VM resulting in a significant saving of 110GB of space across all VMs. In Phase 2, block-level deduplication is performed across all VMs achieving an amazing 118GB reduction of storage space!
So far, Altaro’s VM Backup has saved 228GB of storage space which represents an impressive 47% reduction of VM Backup storage! In Phase 3, the deduplicated data is compressed to just 151GB and transferred to the backup storage.
As noted in the diagram above, the overall VM backup storage requirements has been reduced from 481GB to just 151GB – representing a 68.8% reduction in size and allowing you to have more backups using much less storage space.
Download your free copy of Altaro's VM Backup Solution - Free for specific number of physical hosts!
Compared to other deduplication options, post-process deduplication is a more simple form. All VM backup data is sent to the target storage device for your backups. After this, on a schedule, a process runs on the backup device to remove duplicated data.
While this is simple in that no agents are required on your Virtual Machines, your target backup device will need to be large enough to cater for all backup data. Only after the data is there, will you see a reduction of data in time for the next day’s worth of backups.
Post-process is also problematic because you might need to enforce a “blackout window”, this being a period of time when you should not perform any backups because the storage device is busy moving data around and running the deduplication process.
The benefit of post-process deduplication though, is that it does deduplicate your data and not only on a per VM or per backup window but it often (depending on the implementation and vendor) will deduplicate across all backed up data. This can have a massive space-saving benefit, but only after the deduplication process has run.
Inline Deduplication is an intelligent form of deduplication because it usually runs deduplication algorithms (processing) as the data is being sent to the target storage device. In some cases, the data is processed before it is sent along the wire.
In these scenarios, you can benefit from a target storage device with a lower storage capacity than traditionally required, reducing your backup storage target costs. Depending on the type of data being backed up and the efficiency of the deduplication technology by your vendor, savings can range significantly.
Consider a scenario where you are backing up the same operating system a hundred or more times, deduplication savings would expected to be quite good.
Since inline deduplication does not run on the target storage device, the performance degradation on the device is typically lower than other methods. This corresponds to higher throughput available for more backups to run sequentially, allowing for your VM backups to complete within scheduled backup windows.
The main benefits of inline-deduplication are that your target storage device can have a lower capacity than originally required, additional similar workloads will not add much data to the target and the storage target performance is better than when using other deduplication options. You also benefit from less disk wear which can cause a problem with both HDD and SSD drive types.
One of the drawbacks though is that depending on the implementation, in-line deduplication might not deduplicate your VM job’s worth of data across all data on the target storage array. The implementation could be on a per VM or per-job basis resulting in lower deduplication benefits than other methods.
Augmented Inline Deduplication
Augmented in-line deduplication is an implementation of in-line deduplication used by Altaro’s VM backup solution.
In this implementation, variable block sizes are used to maximise deduplication efficiency. This is all achieved with very low memory and CPU requirements, resulting in extra space for more backups in less space than without any deduplication in place.
Another important consideration here is that less bandwidth is required to ship your VM backup data to the backup storage system. If your backup infrastructure is located in a different building or geographic location, bandwidth can get expensive. Now that data is deduplicated before it is sent across the wire, the bandwidth requirements are reduced significantly.
Altaro’s implementation is impressive because it’s a form of inline deduplication, promising deduplication across all backed up data.
In the graphic below we can see that data is shipped to a central backup target from various Virtual Machines. While this is happening, deduplication processes are running.
The benefits of such a solution are clear;
- Very Fast backups. There is no storage performance lost as there are no post-processes running on the storage target.
- Excellent deduplication rates. Deduplication occurs between the source data and ALL data on the backup target. If the data is already in the backup storage device, it will not be copied to the destination storage again, saving space.
- No operational overhead. There are no agents to install or manage. Installation of the feature is a simple checkbox.
- No additional SSD or HDD wear on the target. Since there are no post-processes there is no “double touch” of the backed up data. This significantly reduces the wear on HDDs and SSDs resulting in fewer disk failures.
If your backup software comes with deduplication as standard, then there is no reason to not use it? This statement is incorrect! You must consider the type of deduplication in use and the overall impact it has on your backup systems.
A key consideration when analysing backup solutions is feature interoperability. Some backup vendors will not support deduplication with other features. An example of this is a storage device which runs post-process deduplication combined with backup software that supports instant VM recovery.
Instant VM recovery, direct from the backup target can be a very beneficial feature for your business, however, you must ensure that the vendor supports this feature on deduplicated storage targets (if this is the type of system your business has in place.)
From a performance perspective, there is no point in having a smart deduplication system if it’s slowing your backups down to the point you cannot complete them. Be sure to trial deduplication features to correctly assess the performance impact on your platforms. Also ensure that there is little or no impact on production Virtual Machines. We know that post-process deduplication has no effect on production workloads, but it is possible that in-line can, so it should be tested.
A quick way to check performance would be to compare backup times before enabling deduplication features with afterwards. From here you can look at a cost-saving vs performance analysis to consider which is better for your business.
Take a look the SMART data for your disks after enabling deduplication for an extended period of time. If the wear-out time on SSDs is significantly reduced, then consider an inline deduplication feature rather than post-process.
If enabling deduplication means installing, upgrading and generally managing agents everywhere, consider another solution which does not require agents. Agents will also consume CPU and Memory which can negatively impact the end-user experience of your applications.
For post-process deduplication ensure you are not limited to time windows for your backups and restores. Also, check the performance of this feature, especially on large backup targets.
The Impact of Augmented Inline Deduplication for VM backups
Deploying a VM backup solution that uses augmented inline deduplication is a great idea if you have limited space on an existing backup target. It’s also a good fit if you are looking at a more expensive SSD option, but do not want to stretch your IT budget to one that will natively store multiple copies of the same Virtual Machine.
An example of some of the storage savings can be seen in the below graphic:
Most organizations have multiple Virtual Machines with the same operating system. A typical Windows Server can have around 20GB of data just for the Operating System. Consider 100’s of similar VMs with daily backups and long retention policies. The savings can be considerable.
Unlike physical machines, VMs do not usually require additional agents for deduplication or backups to run - there are some exceptions of course.
In this article we covered the basics on deduplication, analyzed Post-Process Deduplication, Inline Deduplication and Augmented Inline Deduplication. Further more, we explained the strengths and weakenesses of each deduplication method and provided examples on how organizations can leverage deduplication backups for their VM backups and save space and money.
To wrap-up, there are almost no reasons why a deduplication capable VM backup solution should be ignored when choosing your backup platform. There are some caveats depending on your business and technical requirements, but there are several options available to get started with deduplication.
Fortunately, for the most part, Altaro’s Augmented Inline Deduplication features are a good fit for most scenarios and are available at a competitive price point.
Remember, when selecting your VM backup solution, consider the limitations of the various kinds of deduplication and go with what works best for your business.