Types of Analysis for Synchronization

 

 

 

 

 

 

 

 

 

 

 

 

 

© 1998-2005, Mobiliti, Inc.                                         2025 Lincoln Highway, Suite 322

                                                                                   Edison, NJ  08817

                                                                                   Tel: (732) 248-8300 Fax: (732) 248-8060

                                                                                   E-mail: info@mobiliti.com

                                                                                          www.mobiliti.com

 

 

© 1998 - 2005 Mobiliti, Inc., Edison New Jersey. All rights reserved.

Network/Unplugged, the Mobiliti logo, and any other images associated with the software are trademarks of Mobiliti, Inc., Edison, New Jersey.

Microsoft Windows, Windows 95, Windows 98, Windows NT, Windows 2000, Windows XP and MS-DOS are registered trademarks of Microsoft Corporation in the United States and/or other countries. Other brands and product names are trademarks and registered trademarks of their respective holders. Patent pending.


TABLE OF CONTENTS

 

1      Abstract

2      Glossary

3      Why do analysis?

4      Traversal v/s change log

5      Single processor v/s Multi processor

Conclusion

 


 

 

1         Abstract

Synchronization is the process of making two or more locations identical to each other. When synchronizing, we can dump one side’s data onto the other side or we can intelligently determine the data that has been updated and transmit only the changed data. The second method is much more efficient in most of the cases. The process of finding what has changed is termed in rest of this document as the analysis process. There is no standard analysis process that will work  best under all circumstances. This document describes some common analysis methods when the unit of synchronization is files and folders. Once the changed files and folders are determined, there are quite a few differencing methods to help synchronize the two sides by only transmitting the differences between the two sides. This document does not deal with these differencing methods.

2         Glossary

Source Computer (Source): Source computer is the computer on which the source file resides in a backup scenario. It can also be the side from where the updates are going to move to the other side in a multi-direction synchronization.

Destination Computer (Destination): Destination Computer is the computer on which the destination file resides in a backup scenario. It can also be the side to which the updates are going to move in a multi-direction synchronization.

Synchronization: Synchronization is the process of overwriting the older version of the source or destination file by the latest version.

Source Network: Source Network is the network in which source computer resides.

Destination Network: Destination Network is the network in which the destination computer resides.

Link: Link is the type of connection between source and destination computer. It can be a LAN, WAN or wireless.

Source Process: Source process is the process running on the source computer which backsup or synchronizes the source file to the destination file.

Destination Process: Destination process is the process running on the destination computer which helps in backing up or synchronizing the source file and the destination file. The destination process is optional for many differencing techniques.

Transfer unit: Though it is assumed that files are to be synchronized on both the sides, it may be that the units to be synchronized can be folders or a propriety directory structure containing propriety information. So the term “transfer unit” is a more generic term.

NF: Number of files present in the two locations that are to be synchronized.                   

NC: Number of files that are changed.

 

 

 


 

3         Why do analysis?

When two locations are to be synchronized, or one is to be backed up to the other, either the data synchronization may be happening at real time or at a lag. Analysis is a more significantstep in the second case.

Ta=Time for Analysis

Tc=Time for transferring changes, as determined by the analysis.

Ts=Time taken to transfer all the information from location A to location B in order to keep the two sides synchronized, without doing any analysis. So in this case, we will be transferring the whole data.

Analysis makes sense whenever T a+ T c < Ts. It can be seen that this is the case, more often than not. Additionally, analysis is a powerful feature as it helps with  synchronization preview, conflicts, and change priority, space and sanity checks.

 

As mentioned earlier, there is no one way of doing analysis. Few different methods are explored below.

 


 

4         Traversal v/s change log

The traversal process recursively scans the locations to be synchronized for some attribute(s) for all the files or transfer units which need to be synchronized. Some examples of these attributes are archive attribute, time stamp, checksum and file size. In a simplistic situation, we can assume that the time taken to get the given attribute for a given transfer unit is a constant (same for all transfer units / file) i.e. CT. If TN is the total time to analyze,

                         TN =NT* TF

The change log analysis dynamically notes the names of the file in a log file as and when the changes are happening.  During actual synchronization, this log file is consulted to determine the list of files or transfer units that have changed. So if TR is the time taken to analyze using this method, and in a simplistic scenario, if we assume that the time taken to get a single file from the monitor list is a constant CR, therefore

                              TR=CR*NC

Change log needs to take care of the following:

1.      Ignoring temporary files

2.      Ignoring multiple records for multiple changes to the same file.

 So the above equation can be modified to:

                               TR=CR* NC + optional Duplicate and temporary files removal process.                                            

If during the recording time itself, we ensure that no duplicates or temporary files are stored, then the above equation becomes

TR=CR* NC

The ratio between the time taken in traversal and change log analysis is given by:

                       TN / TR= CT* NF   / CR* NC    

 

 

 

So if the number of files to be modified or changed (NC) is less than the total number of files to be analyzed (NF), the change log analysis works out better than the traversal mechanism. Performance of change log analysis goes down as this ratio increases. Traversal mechanism scores points over ease of implementation and robustness.     

If we know that changes are happening on only one side, either of these methods can be improved to achieve better performance by ignoring analysis for the locations that are not changing.

 

5         Single processor v/s Multi processor  

 The methods of analysis may be classified by the distributiveness of the analysis code execution. For our discussion, we can generalize them into 2 cases:-

                             1. Single processor based analysis.

                             2. Multi processor based analysis

 In single processor method, process of analysis runs from only one processor. As a result, if the destination (or other side) is in different network location or in different machine, then the link speed between the two sides becomes an important factor.

On average, if Nf  is the total number of files that are present on both sides, then it is most likely that there will be Nf / 2 numbers of files on either side (in case of backup – source and destination)

Therefore the time taken for analysis on average is

   TN1=CL* Nf / 2+CNet* Nf / 2

Where

  CL=Local access time to get the traversal attribute of a transfer unit or file

  CNet =Network access time to get the traversal attribute of a transfer unit or file

 

In multi processor analysis, for optimal performance, the processes responsible for computing analysis are present on both sides of the network. So, just like the earlier case, if Nf is the number of files on both sides of the network, on average, there would be Nf / 2 files on either side.

Therefore the time taken for analysis would be:

   TN2=Larger of (CL* Nf / 2, CNet1* Nf / 2)

  where

             CL= Local access time to get the traversal attribute of a transfer unit or file               CNet1 = Network access time to get the traversal attribute of a transfer unit or file from a processor in the same network.

 

Since      CNet1 <   C Net and the second case is not the sum but  larger of the two component analysis, multi processor analysis is expected to perform better than the single processor analysis.             

 

Conclusion

In summary, when synchronizing or backing up data, there is a much greater chance that analysis followed by synchronization yields better results than synchronizing all the data (without analysis). Traversal based methods work better in relation to change log methods when the ratio of changed transfer units is higher compared to number of changed units. It is sometimes advantageous to use multiple processors for faster analysis.