What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?

HelloWill

We have a large FreeNAS server that is loaded with files. I am looking for advice on the best way to get things cleaned up, and I know there's tons of duplicates.

File Types:

Images
Text
Videos

File Counts:

10,000,000+ Files
200+ TB

I've tried running many other duplicate scanners, but they haven't been easy because the scanners crash when they get logs too big, it's hard to get context, and it takes days to scan without checksums (Takes a really long time to checksum (MD5) files). And to top it off, they only run on one PC so I can't even enlist the rest of the team to help clean up.

I need a way to make it so that we can easily scan files, identify duplicates, and be able to ideally save scan results and checksums such that we don't need to keep re-scanning the same files again and again. I like beyond compare, but it helps after the duplicates have been identified.

What do you guys do to scan this much data and make sense of it / organize it?

NashBrydges

Saw your post on the last evening.

Issue #1: using FreeNAS at all for production sotrage
Issue #2: using FreeNAS for such a LARGE production storage

How have you been running the dedupe scanners? Via a PC connected to the shares on the FreeNAS server? I'm not familiar enough with FreeBSD to know if there are commands that can be run from shell to check for dupes.

scottalanmiller

I've never used Duff, but it should run there. But the scale might be problematic.

scottalanmiller

At that size, there is no simple way to handle this. The file comparison process is incredible. You need a checksum for over 10,000,000 files, that alone is no small task, and then you need to compare every file to every other file, that's 1x10^14 MD5 comparisons.

If you can find any ways to limit these comparisons, that might help. But the number of them is so high that probably no normal tools will tackle it.

scottalanmiller

What might work is making a database (that is stored elsewhere) that will hold all of the MD5s and sort them alphabetically. Then you only need to use either the databases own duplication checking and/or check against neighbour values. Then you'll know where duplicates are possible.

StrongBad

Wow that is a lot of files. That's going to take forever.

You need to run this from a PC, not from the server? Is this a SAN then, not a NAS?

HelloWill

I can run any software from either a workstation or the server, however running things directly on FreeNAS makes me nervous because i'm not sure how it will react.

The files are shared as a NAS, although we could connect via iSCSI or similar

scottalanmiller

@hellowill said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

I can run any software from either a workstation or the server, however running things directly on FreeNAS makes me nervous because i'm not sure how it will react.

The files are shared as a NAS, although we could connect via iSCSI or similar

That would corrupt the data. If it is shared as a NAS, then you need to run everything from the server. That rules our iSCSI. iSCSI would corrupt or just delete all of your data since it would need to format the space as a new drive before mounting it.

dbeato

@hellowill said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

We have a large FreeNAS server that is loaded with files. I am looking for advice on the best way to get things cleaned up, and I know there's tons of duplicates.

File Types:

Images

Text

Videos

File Counts:

10,000,000+ Files

200+ TB

I've tried running many other duplicate scanners, but they haven't been easy because the scanners crash when they get logs too big, it's hard to get context, and it takes days to scan without checksums (Takes a really long time to checksum (MD5) files). And to top it off, they only run on one PC so I can't even enlist the rest of the team to help clean up.

I need a way to make it so that we can easily scan files, identify duplicates, and be able to ideally save scan results and checksums such that we don't need to keep re-scanning the same files again and again. I like beyond compare, but it helps after the duplicates have been identified.

What do you guys do to scan this much data and make sense of it / organize it?

I am sure you have checked ZFS Deduplication correct? http://www.freenas.org/blog/freenas-worst-practices/

The way this is setup it should be spread out not just in one NAS device.

scottalanmiller

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

dbeato

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

I know, I was just saying that FreeNAS and deduplication don't work well in other words...

scottalanmiller

@dbeato said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

I know, I was just saying that FreeNAS and deduplication don't work well in other words...

I see, yes, it's a bit of a dilemma. In reality, nothing works great with dedupe, it's a difficult thing to do at large scale.

Obsolesce

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

How did you get 256GB of RAM needed?

That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

scottalanmiller

@tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

How did you get 256GB of RAM needed?

That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

Obsolesce

Would you consider looking for duplicate files from the server directory by directory, rather than everything all at once?

Maybe scan in 500,000 file chunks and start reducing it little by little manually.

Obsolesce

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

How did you get 256GB of RAM needed?

That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

That's just for the ZFS file system itself.

If using deduplication, then 5gb per tb. Dedup has it's own requirements.

scottalanmiller

@tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

@dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

How did you get 256GB of RAM needed?

That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

That's just for the ZFS file system itself.

If using deduplication, then 5gb per tb. Dedup has it's own requirements.

Oh right, poop. Yeah that's a lot of RAM needed.

DustinB3403

@tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

Would you consider looking for duplicate files from the server directory by directory, rather than everything all at once?

Maybe scan in 500,000 file chunks and start reducing it little by little manually.

This would likely be the only way to do it.

Ive used a few different tools (windows ones) that could scan directories and compare for hash matches. I'm sure there is a better Linux alternative.

HelloWill

I was hoping there was some type of server migration software or enterprise deduplication software that would be able to crawl all our data, store the results in some type of database and then allow us to parse the results.

When you throw 10MM files at traditional duplicate cleaners, they tend to blow up. Then, after you clean some parts up, guess what... you have to rescan and wait.

There has to be a better way. Block-level deduplication solves part of the storage size equation, but doesn't address the root cause of the problem in the first place which is poor data governance. The challenge is going from messy > organized in an efficient manner.

Has anybody used this, or know of something similar?
http://www.valiancepartners.com/data-migration-tools/trucompare-data-migration-testing/

DustinB3403

@hellowill the biggest issue is you have way to many files and not enough resources to scan and dedup the system live.

Your only reasonable approach is to do this in smaller chunks at a time. Since we can reasonably assume you don't have a TB+ of ram to throw at this job nor anywhere to store the updated files.