Our Cloud Providers offer Disaster Recovery services and backup and restore services using all sorts of technologies, these are often measured against each other for their relevant benefits or shortfalls. Specifically, Disaster Recovery, there are two key technologies that you can use to replicate data; journaling & snapshots, each has its pros and cons. In this blog we will look around each and hopefully detract from some of the myths around both.
Firstly, snapshots. What are
Snapshots are a point in time of the state and data of a Virtual Machine
(VM), they should not be considered a backup. The state includes the VM’s power
state at the time of snapshot and data includes all files and folders that
constitute the VM, including VM file items like memory and vNIC, vmdk etc, this
typically is achieved by ‘stunning’ the VM or quiescing. (Stunning a VM makes
the VM temporarily unavailable for operations like vMotion, snapshot, etc and
unreachable, meaning there is application outage.)
Always consider the ‘state’ of a replication, if the OS does not support
quiescing, then you should be carefully considering the process to recover in
the event of a Disaster to ensure continuity for your application type – do not
Why should they not be considered a backup? Well simply put, they are useful
for a quick rollback point, as they primarily record delta between the original
source disk from a certain point in time, also the parent must be on the same storage
infrastructure for this to work. Backups need to be separately and whole (not
dependant on parent), and certainly not on the same infrastructure they are
protecting from failure. Snapshots are therefore inherently – short term.
Does VMware Cloud Director
Availability use snapshots?
In a nutshell, no, we do not. We use Delta vmdk at the source and Point in Time (PIT) instances at the destination, these are not the same as snapshots. Multiple Point in Time (PIT), mPIT, are based on deltas from the source disk, these are sent via the Host Based Replication agent (HBR) IO Filter on the ESXi host. VMware Cloud Director Availability does not stun the VM for the source replication therefore it is not the same as a snapshot.
vSphere Replication, has several components that are used by VMware
Cloud Director Availability; the Data Mover and the Host Based Replicator (HBR)
services on ESXi hosts, these essentially filter the IO from the running VM to
the lightweight delta syncs, which is then replicated to the target site via a
Replicator Appliance. You can have
multiple PITS (mPITS) and vSphere Replication supports 24 instances – that’s 23
lightweight deltas from the initial sync before they are cycled and
For the initial sync, only the written areas of the ‘base’ disk of the
VM are marked as ‘dirty’ and the whole vmdk is copied, this does cause a small
increase in IO as disk is being replicated, and the time it will take will
depend a lot on the source and target datastore types. From this point on only delta
changes are marked as dirty and needing copy.
Let’s define what is an initial full
When replication is first configured using
VMware Cloud Director Availability for a powered on virtual machine. The entire
contents of a source virtual disk (VMDK) are copied to its target empty virtual
disk (which is created at the target location), these are compared using
checksums. This process identifies the differences between the source and
target virtual disks, which in a new replication is zero, requiring very few
CPU cycles. While checksum comparisons are being calculated and compared,
vSphere Replication will periodically send any differences that were
discovered. The amount of time it takes to complete a full sync primarily
depends on the size of the virtual disk(s) that make up a virtual machine, the
amount of data that must be replicated, and the network bandwidth available for
How about lightweight delta syncs?
After a full sync is completed, the vSphere
Replication agent built into vSphere tracks changes to virtual disks belonging
to virtual machines configured for replication. A lightweight delta sync
replicates these changes on a regular basis depending on the recovery point
objective (RPO) configured for replication of a virtual machine. For example,
changed data in a virtual machine with an RPO of four hours will be replicated every
four hours. Data at the 4hr window will go into the lightweight delta vmdk, so
anything added and removed within the 4hrs will not be in the delta, this
reduces networking traffic as multiple writes to the same block within the RPO
interval will only result in the last value being sent at sync time. This
schedule can change based on a number of factors such as data change rates, how
long each replication cycle takes, and how many virtual machines are configured
for replication on the same vSphere host. vSphere Replication uses an algorithm
that considers these factors and schedules replication accordingly to help
ensure the RPO for each replicated virtual machine is not violated.
Thanks @jhuntervmware for the blog relating to this.
What about performance?
The more lightweight delta sync vmdk you have, the more potential variant
data between the deltas, the longer it will take to restore as the more delta
‘drift’ from the base, the more compute and storage will be needed to consolidate
all the differences. Also, the more you maintain, the more disk space will be
used. If the destination datastore runs out of space, no new delta instances
can be created, but the existing instances are unaffected. When free space becomes
available a new incremental instance is created (albeit likely with some RPO
violation), but there are no re-syncs.
With VMware Cloud Director Availability, we use SLA policies to control
what is allowed for replication retention policies (number of PIT instances) and
an overarching Recovery Point Objective. Remember a PIT is a lightweight delta sync,
a vmdk delta record, but it is
not a whole vmdk as there has been no stunning or cloning, it is a subset of
the chain from the base disk to a designative instance and is an independent
Lightweight vmdk delta sync are cascading and
interdependent, there for each time an instance expires some disk space is
freed. You can retire instances explicitly, or you can urgently reclaim space
after unusual I/O events that create larger instances, by adjusting the Cloud
Director Availability retention policy. This will retire (a.k.a. consolidate)
the instances. Then you can restore your normal retention policy when the space
is available. Providers can monitor all aspects of IAAS consumption and view storage
allocated vs used by tenant organization.
Stored instances – are these backups
Stored Instances in VMware Cloud Director Availability are delta vmdk.
They are deltas that have been ‘taken out’ of the retention cycle and therefore
needs the parent disk to be used for a restore purpose. A VMware Cloud Director
Availability replication can have both stored instances and rotated instances (mPITS)
and these are controlled by policies. These will be relatively fast to restore
in comparison with ongoing deltas (they will be closer to the source parent)
that will grow (‘drift’) as time moves on.
Does consolidating disks help increase
Yes and no. Consolidating takes the base disk and all the deltas into
account, this disk hierarchy consolidation process runs to the restore point
required. It is a compute and storage intensive task to aggregate and compare
the deltas, and from that point on all deltas are created with the consolidated
disk as the parent. For a failover, consolidating when there is a disk
hierarchy of deltas will not be quicker if there are a lot of delta vmdk
to compare, especially if there is a high data change rate. However, from that
point on recovery will be fast as the ‘drift’ between the base vmdk and delta
vmdk will be less. VMware Cloud Director Availability allows you to force
consolidation as a part of recovery, or later regardless of the VM power state.
What about RPO & RTO?
Some vendors state a very fast RPO with journaling, this is true where
every changed data block is written constantly to the target, why then would I
use delta instances? The typical VM generates valuable data that needs to be
replicated, using split IO whereby IO goes to 2 points; the VM and the target
recovery delta or journal, there is virtually zero recovery time objective
(RTO), until data is generated. Of course, data is generated, after all you are
ensuring recovery of an important workload, so there is little chance it will
stay still. When data is generated, journaling, which delivers ultimate
granularity, considerably increases in RTO, the more data written, the longer
the recovery time objective (RTO) to consolidate. With MPITs there is also an
RTO impact, the impact is as big as the time to compute the delta difference
between the source parent and the desired MPIT to restore, unlike journaling,
most of the consolidation will be already done in the delta. As already stated,
the more MPITs in between the longer the RTO. Remember the data is still at a granular
RPO (VMware Cloud Director Availability supports 5min RPO), but the time to
restore (RTO) is longer as compute is needed in both journaling and MPITs to
recover the data.
A good rule of thumb for high data intensive workloads would be to consolidate regularly and keep the number of MPITs low so they are never too far away from the parent base disk. For low data intensive VMs you can significantly expand the number of MPITs and consolidate on rotation as required.
Another point to note about RPO and data generation. When data runs
though the IO Filter, it is directed to the VM datastore and (depending on the
technology you are using) to a lightweight delta vmdk or to a journal. The
route to that destination needs to be fast, especially for journaling and the
traffic volume will be high if every block of data change is being written
24/7. However, Delta instances only execute on the replication schedule, hence
traffic is low and volume of data is low, usually this radically simplifies
networking and makes for a much simpler deployment and operational solution to
Speed of recovery (RTO) and acceptable data loss (RPO) for a VM
Minimizing data loss (RPO), but increasing recovery time (RTO)
What about rehydration?
Just like animals need to rehydrate, so too data, which needs to be
accessible, needs to be decompressed to be useful. Some vendors point out that the
recovery rehydration process is ‘instant’, there is no ‘instant’ in
decompression, which uses compute; affects performance and disk storage space
utilization. When data is rehydrated the data blocks are rendered on the disk
in their original form. This can be ‘optimized’ to negate long term disk impact
by vendors supporting rehydrate on read functions, however this will still
affect the storage used whilst the data is rehydrated.
Journaling has a
rehydration process that is compute and storage intensive (imitates file system
and needs to convert data to vSphere format) so, in summary, there is an impact
on performance to use journaling with rehydration.
VMware Cloud Director Availability uses native vmdk, so there is no need to covert the data to a vmdk format, it is already in this format and hence is optimized out of the box.
look at the recovery point. Journaling tech offers granular flexibility in
recovery, as recovery can be instantiated from any point. Whereas PITs are set
at regular intervals.
Let’s take an example, your VM has a 4hr RPO, you discover ransomware on your VM at 12.30 post infestation (point B) on the graph below.
infection was delivered by a trojan horse at 9AM but did not activate until 13:00.
Therefore, the ransomware trojan horse was replicated in the lightweight delta
sync at Midday, you would need to go back to the previous delta 4hrs before at
8AM, meaning 4hrs of data is lost, but the ransomware would not be present. Is 4hr
data loss acceptable? This is the granularity impact and aligned to RPO.
However, the speed of recovery is important, in this case you have several restore points to choose from. You would start with a restore the previous delta, which in this case is a consolidated vmdk, not a delta, and hence the compute comparison against changed blocks on the consolidated delta would be nothing as there is nothing to compare. This would be a fast recovery, but the Trojan horse is still present. Going back to the next delta would be at 8:00AM and wouldn’t contain the trojan horse, this would need to be compared against the last deltas and consolidated source disk, but given the volume of delta changes is small, again this would be a fast RTO. Although you have some data loss you are up and running again fast.
As a rule, the granularity
RPO should be aligned to the criticality of the VM. VMware Cloud Director
Availability supports a 5min RPO, but the actual recovery time would be the key
difference and that would depend how far the delta is away from the source
consolidated disk. Keeping the number of recovery points low will ensure
a faster recovery as consolidation will happen faster and there is less data
drift between source and the last delta. Our recommendation is to find an
acceptable RPO and number of mPIT, then use backup to take a permanent copy.
Journaling on the other hand provides 100% granularity, with every block changed, replicated, there is potential for zero data loss, but this doesn’t allow for application consistency, as with lightweight delta, unless quiescing is used (and even then) there is no guarantee of success, particularly with memory resident applications. As I mentioned at the start of the blog, ‘always consider the ‘state’ of a replication’.
journaling you can rewind and recover from any point in time, if there was
malware detected in the chart above, you could ‘rewind’ to any point on the
blue line, before the event. You can also do file level restore, which is an
advantage over lightweight delta. As you can see the deltas only run on the 4hr
window and not consistently like journals, hence the closest delta would then
be measured against the source vmdk at the last consolidation. As with
everything there is a tradeoff; journaling can bring granularity, but at what
- Networking: Journaling requires fast
network speeds to copy each byte, this is often done locally to avoid this
overhead, also often journal data will be compressed at source; this is needed
to minimize bandwidth requirements. This gives the added ability to recover locally
also and faster, but this is dependent on the type of disaster you are
recovering from. Faster networks cost more and are more complex, this means you
can expect to spend more for this solution. A core difference with Lightweight
Deltas is that multiple writes to the same block within the RPO interval will
result in only the last value being sent. This reduces the network usage.
- Storage: whilst both use storage, the amount used will depend on the granularity of sync, the more granular the more data needs to be stored. In this case journaling will create the most storage overhead as at the target site the data disk + the journal would be kept, but the flip side is a more granular restore capability.
As the diagram above
shows, whilst the volume of data for journaling
and Delta sync processes is not really a storage factor (more a networking one),
the aggregated storage volume is. The further along the timeframe, the more
there is a difference between aggregated storage of journal vs delta.
Note that this
is not including any deduplication storage technologies at the target site but
is an accurate representation of volume.
Hopefully what I
have laid out is that there is no simple answer to this, much of the
performance depends on the data change rate you will have and the granularity
of restore you want. Don’t be mistaken. There is no clear winner, each approach
to replication and recovery has positive and negative points, and there is a
commercial cost that I have not explored here also.
I think the
major take home to make is that there is no real per second RPO (without synchronous
replication with expensive storage arrays and very costly networking), this
only exists if there is no data change, and then, that would perhaps apply to
all products. Would the fastest be the native VMDK solution that doesn’t need
conversion? It is a moot point as there is very little chance of zero data
change in reality.
Reality means large volumes of data, reality means unpredictable
scenarios, reality means communication ups and downs, and unexpected spikes in
utilization. Just like your cloud choices, it’s the right DR solution for the
workload, that is what is important. Remember
the key levers:
- Compute resources: Journaling will probably require more compute than deltas but this is dependent on consolidation schedules. Keeping smaller journals and less deltas will minimize compute overhead when comparison operations are required for restore.
- Storage: Again, the
more granularity, the more storage needed. Be sure to assign the correct level
of granularity to the VM criticality.
- Network: Journaling,
replicating off site consistently will need to have assigned bandwidth reserved
so that replication is not affected. Local replication will minimize WAN costs
but will not provide regional protection.
much can you afford to lose? Nothing, then journaling is your answer, 5 min or
more, then lightweight deltas will be a suitable solution. Again, be sure to
assign the correct level of granularity to the VM criticality.
- Time to
recover RTO: For Deltas, this will be dependent on data drift and the number
of deltas between the last one and the last consolidated vmdk. For journaling
this will again be dependent on the amount of journal data that needs to be
consolidated and converted to vmdk format.
As with many
technical solutions, there is no right and wrong about journaling or deltas,
these are just the typical DR replication types. There is however a cost
consideration, if you need the granularity of journaling, then there is likely
to be higher costs in storage and critically networking.