Filegear Replication and Deduplication

Filegear Replication and Deduplication

Introduction

Filegear supports automatic replication and deduplication of your files. What this means is that when enabled, Filegear will find each unique file, and make sure there is "exactly one copy on any one drive", and that there are "at least 'n' copies of each file "n" drives". These settings work together to let you mirror exactly "n" copies of files across your attached drives, to make sure you're not wasting extra space, and that you have copies of files in case of drive failure.

 

Physical vs. Logical Files

When Filegear indexes files from connected hard drives, NAS drives, and cloud drives, it creates logical entries for each file that is found by extracting name, size, timestamps, locations, and other metadata from each file. These logical entries are what you are looking at when navigating the Filegear interface. Behind the scenes, these logical files point to the actual physical files on disk. By separating "Physical" from "Logical" views, we're able to deliver many features that would be hard to do if we just had one "Physical" view. For example, searching over all the metadata, creating multiple albums from the same files, creating stable sharing links that continue to work even if the file is moved, and more.

When we talk about Replication and Deduplication features, it's important to note which kind of files we're talking about: "Physical" or "Logical".

Replication and deduplication features currently available in Filegear perform these functions on physical files, not on logical files. That means that even after deduplication completes in the background, there may still be duplicate file entries showing in the UI because you are looking at logical files, not physical ones.

Filegear will soon be releasing additional tools to help manage logical duplicates as well. Stay tuned for future product updates to enjoy these features.

How Physical Deduplication and Replication Works

Imagine that you have two physical drives: Drive 1 and Drive 2. These two drives have the following content:

    Filegear:
        Drive 1:
            File A (copy 1)
            File A (copy 2)
            File B
        Drive 2:
            File A
            File C (copy 1)
            File C (copy 2)

When you enable "Physical Deduplication", we get the following result:

    Filegear:
        Drive 1:
            File A (copy 1)
            File A (copy 2)
            File B
        Drive 2:
            File A
            File C (copy 1)
            File C (copy 2)

When you enable "Physical Replication with 2 Copies", we get following result:

    Filegear:
        Drive 1:
            File A (copy 1)
            File A (copy 2)
            File B
            + File C
        Drive 2:
            File A
            + File B
            File C (copy 1) 
            File C (copy 2)
 

So finally, after enabling both "Physical Deduplication" and "Physical Replication with 2 Copies", we get the following result:

    Filegear:
        Drive 1:
            File A
            File B
            File C
        Drive 2:
            File A
            File B
            File C

 

Notice that we now have two drives that are mirror images of the other (in terms of their content).

How Deduplication Works

Duplicate files are those files that have exactly the same bytes, byte-for-byte. When Filegear looks for duplicates, it first searches for files that have the same hash. That means a cryptographic signature that's calculated using an algorithm that almost guarantees that two files cannot have the same value if even one bit is different.

The deduplication process works like this: 

  1. For each unique HASH value, find all files with that hash on one drive
  2. For each duplicate file, double-check that the contents are the same
  3. Choose one file to keep (we choose the one with the shortest path)
  4. For all other files, move the files to a holding area for cleanup later
  5. For each moved file, create a small text file in its place that points to the file chosen in step 3.

Note: the files created in step 5 will have the same name as the original file, followed by the suffix ".dedup.txt".

How Replication Works

Replication works by making copies of each unique file on each disk on other disk drives to achieve some replication amount.

The replication process works like this: 

  1. For each unique HASH value, find all files with that hash on one drive
  2. Check to see if there are at least "n - 1" copies of a file with the same HASH on other drives.
  3. If insufficient copies exist, then try to find a drive without that file and with sufficient space to make the extra copy.
  4. If another drive is found, make a copy of the file on that drive.
  5. Repeat until a sufficient number of copies exist, or no more drives are available.

What's Next

We're not done yet. Physical deduplication and replication are great behind the scenes, but there are still pesky duplicate files in the user interface.

Filegear has already released a feature called "Merge Folders". This feature allows you to take two folders and merge all the content into one folder, removing duplicates in the process. This is really handy for those cases when you have two folders on two separate disk drives that got imported, leaving you with two nearly identical entries in the UI. The best way to experience this feature is to go to Photos / Albums in the web interface. When you see two folders with the same name, select the two folders, right-click, then select "Merge Folders".

We're also working on a "global deduplicate" feature which will go through your entire collection and provide a list of files that can be deduplicated. You'll have the final say on whether to continue the operation.

Conclusion

Filegear Physical Deduplication and Replication work together to help you manage the physical files located on your hard drives, NAS drives and NAS drives. With Filegear you can eliminate wasted space by removing duplicate copies on a single drive, while also achieving a disaster recovery plan in case any one drive or service becomes unavailable.

1 comment

Jan 12, 2019 • Posted by Dave Haynie

I’d like to see automatic cloud replication. I have recently set up two Filegear devices and two drives at different physical locations. Right now these show up as two separate drives, but there would be a huge value in allowing the storage across both devices (or at least some of it) to exist as a single logical volume. That would mean redundant access to my media, more reliable synchronization from synchronized devices, and when Godzilla attacks, he’d have to stop two seaprate locations to wipe out all my data.

Leave a comment