Types of Analysis for Synchronization
©
1998-2005, Mobiliti, Inc.
Tel:
(732) 248-8300 Fax: (732) 248-8060
E-mail:
info@mobiliti.com
© 1998 - 2005 Mobiliti, Inc.,
Edison
Network/Unplugged, the Mobiliti
logo, and any other images associated with the software are trademarks of
Mobiliti, Inc.,
Microsoft Windows, Windows 95, Windows 98, Windows NT,
Windows 2000, Windows XP and MS-DOS are registered trademarks of Microsoft
Corporation in the
TABLE OF CONTENTS
5 Single processor v/s Multi
processor
Synchronization is the process of
making two or more locations identical to each other. When synchronizing, we
can dump one side’s data onto the other side or we can intelligently determine the
data that has been updated and transmit only the changed data. The second
method is much more efficient in most of the cases. The process of finding what
has changed is termed in rest of this document as the analysis process. There
is no standard analysis process that will work
best under all circumstances. This document describes some common
analysis methods when the unit of synchronization is files and folders. Once
the changed files and folders are determined, there are quite a few
differencing methods to help synchronize the two sides by only transmitting the
differences between the two sides. This document does not deal with these
differencing methods.
Source
Computer (Source): Source computer
is the computer on which the source file resides in a backup scenario. It can
also be the side from where the updates are going to move to the other side in
a multi-direction synchronization.
Destination
Computer (Destination):
Destination Computer is the computer on which the destination file resides in a
backup scenario. It can also be the side to which the updates are going to move
in a multi-direction synchronization.
Synchronization: Synchronization is the process of overwriting the
older version of the source or destination file by the latest version.
Source
Network: Source Network is the
network in which source computer resides.
Destination
Network: Destination Network is
the network in which the destination computer resides.
Link: Link is the type of connection between source and
destination computer. It can be a LAN, WAN or wireless.
Source
Process: Source process is the
process running on the source computer which backsup or synchronizes the source
file to the destination file.
Destination
Process: Destination process is
the process running on the destination computer which helps in backing up or
synchronizing the source file and the destination file. The destination process
is optional for many differencing techniques.
Transfer
unit: Though it is assumed that
files are to be synchronized on both the sides, it may be that the units to be
synchronized can be folders or a propriety directory structure containing propriety
information. So the term “transfer unit” is a more generic term.
NF:
Number of files present in the
two locations that are to be synchronized.
NC:
Number of files that are changed.
When two locations are to be
synchronized, or one is to be backed up to the other, either the data
synchronization may be happening at real time or at a lag. Analysis is a more
significantstep in the second case.
Ta=Time for Analysis
Tc=Time for transferring
changes, as determined by the analysis.
Ts=Time taken to transfer
all the information from location A to location B in order to keep the two
sides synchronized, without doing any analysis. So in this case, we will be
transferring the whole data.
Analysis makes sense whenever T a+
T c < Ts. It can be seen that this is the case, more
often than not. Additionally, analysis is a powerful feature as it helps with synchronization preview, conflicts, and change
priority, space and sanity checks.
As mentioned earlier, there is no one
way of doing analysis. Few different methods are explored below.
The traversal process recursively scans the
locations to be synchronized for some attribute(s) for all the files or
transfer units which need to be synchronized. Some examples of these attributes
are archive attribute, time stamp, checksum and file size. In a simplistic
situation, we can assume that the time taken to get the given attribute for a
given transfer unit is a constant (same for all transfer units / file) i.e. CT.
If TN is the total time to analyze,
TN =NT* TF
The change log analysis dynamically notes the
names of the file in a log file as and when the changes are happening. During actual synchronization, this log file
is consulted to determine the list of files or transfer units that have changed.
So if TR is the time taken to analyze using this method, and in a
simplistic scenario, if we assume that the time taken to get a single file from
the monitor list is a constant CR, therefore
TR=CR*NC
Change log needs to take care of the following:
1.
Ignoring
temporary files
2.
Ignoring
multiple records for multiple changes to the same file.
So the above
equation can be modified to:
TR=CR*
NC + optional Duplicate
and temporary files removal process.
If during the recording time itself, we ensure
that no duplicates or temporary files are stored, then the above equation
becomes
TR=CR* NC
The ratio between the time taken in traversal and change
log analysis is given by:
TN / TR=
CT* NF / CR*
NC
So if
the number of files to be modified or changed (NC) is less than the total number of files to be analyzed
(NF), the change log analysis
works out better than the traversal mechanism. Performance of change log
analysis goes down as this ratio increases. Traversal mechanism scores points
over ease of implementation and robustness.
If we
know that changes are happening on only one side, either of these methods can
be improved to achieve better performance by ignoring analysis for the
locations that are not changing.
The methods of analysis may be classified by
the distributiveness of the analysis code execution. For our discussion, we can
generalize them into 2 cases:-
1. Single
processor based analysis.
2. Multi processor based
analysis
In single processor method, process of
analysis runs from only one processor. As a result, if the destination (or
other side) is in different network location or in different machine, then the
link speed between the two sides becomes an important factor.
On
average, if Nf is the total number
of files that are present on both sides, then it is most likely that there will
be Nf / 2 numbers of files on either side (in case of backup –
source and destination)
Therefore
the time taken for analysis on average is
TN1=CL* Nf
/ 2+CNet* Nf / 2
Where
CL=Local access time to get the
traversal attribute of a transfer unit or file
CNet =Network access time to get
the traversal attribute of a transfer unit or file
In
multi processor analysis, for optimal performance, the processes responsible
for computing analysis are present on both sides of the network. So,
just like the earlier case, if Nf is the number of files on both
sides of the network, on average, there would be Nf / 2 files on
either side.
Therefore
the time taken for analysis would be:
TN2=Larger of (CL* Nf
/ 2, CNet1* Nf / 2)
where
CL= Local access time
to get the traversal attribute of a transfer unit or file CNet1 = Network
access time to get the traversal attribute of a transfer unit or file from a
processor in the same network.
Since
CNet1 < C Net
and the second case is not the sum but larger of the two component analysis, multi
processor analysis is expected to perform better than the single processor
analysis.
In
summary, when synchronizing or backing up data, there is a much greater chance
that analysis followed by synchronization yields better results than
synchronizing all the data (without analysis). Traversal based methods work
better in relation to change log methods when the ratio of changed transfer
units is higher compared to number of changed units. It is sometimes
advantageous to use multiple processors for faster analysis.