Checking multiple Directories to confirm all files are identical
-
@eddiejennings said in Checking multiple Directories to confirm all files are identical:
To make sure I'm understanding what you want to do:
Let's say you have
dir1
with filesa
,b
, andc
anddir2
with filesd
,e
, andf
. You're wanting to do the following check for duplicates.
Isa
a duplicate ofb
,c
,d
,e
, andf
?
Isb
a duplicate ofc
,d
,e
, andf
?
and so on, correct?Yes and no, I want to make sure that dir2 is an exact copy of dir1 (and lastly compare dir3 to dir1 and dir2).
-
Also all of these directories dir2 and dir3 are on remote servers, so I'd have to do this over UNC Share.
D:\dir1
\srv2\dir2
\srv3\dir3 -
While I'm almost positive the powershell above would work, I suspect it would only work on much smaller directories.
Each directory that I'm trying to compare is over 10TB in capacity
-
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@eddiejennings said in Checking multiple Directories to confirm all files are identical:
To make sure I'm understanding what you want to do:
Let's say you have
dir1
with filesa
,b
, andc
anddir2
with filesd
,e
, andf
. You're wanting to do the following check for duplicates.
Isa
a duplicate ofb
,c
,d
,e
, andf
?
Isb
a duplicate ofc
,d
,e
, andf
?
and so on, correct?Yes and no, I want to make sure that dir2 is an exact copy of dir1 (and lastly compare dir3 to dir1 and dir2).
The doing part can be eaily done with
robocopy /MIR
. Of course, it'll take a while given the number of files. The reporting part is the challenge. You might want to look into usingGet-FileHash.
That's how I typically compare files, but I've never done a comparison at that scale before. -
@eddiejennings Yeah I was thinking of the same solution as well, my trouble is how would I get the system to not try and store everything to memory first and then write to file. . . .
Some of these customer requests are insane...
-
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@flaxking said in Checking multiple Directories to confirm all files are identical:
I would think you would be able to use robocopy to do a diff
Probably, but the issue still comes down to system resources.
Anything that is storing in memory will quickly consume the available resources.
Maybe if I pipe the about to a file it won't be so bad..
It's bound to be a lot more efficient than your powershell.
-
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@eddiejennings Yeah I was thinking of the same solution as well, my trouble is how would I get the system to not try and store everything to memory first and then write to file. . . .
Some of these customer requests are insane...
If you do the equivalent of
md5sum
with subdirectories you will get md5 sums of all files. A diff with produce the different files.
File size or directory size will not matter at all for this operation.Get-FileHash seems to output multiple lines per file which is not good for this.
If you don't need hash to compare and just wanted to check filenames, file sizes and dates, maybe you should just do a directory listing for each tree and compare them. That would be very fast.
You could get
dir
to provide a one-file-per-line output, with the proper options. -
@pete-s said in Checking multiple Directories to confirm all files are identical:
If you don't need hash to compare and just wanted to check filenames, file sizes and dates, maybe you should just do a directory listing for each tree and compare them. That would be very fast.
While I don't need the hash's of the files I was hoping to get some automated way of saying these files aren't*** in dir#.
But I can't for the life of me think of a good way to do that without eating up all of the ram in the world....
-
@flaxking said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@flaxking said in Checking multiple Directories to confirm all files are identical:
I would think you would be able to use robocopy to do a diff
Probably, but the issue still comes down to system resources.
Anything that is storing in memory will quickly consume the available resources.
Maybe if I pipe the about to a file it won't be so bad..
It's bound to be a lot more efficient than your powershell.
It's still going to consume more ram than any host in the environment has to process the job. Just between any 2 directories there's over 20 million files.
-
I know WinMerge has a folder comparison feature, but not sure it can handle your file count.
-
@danp said in Checking multiple Directories to confirm all files are identical:
I know WinMerge has a folder comparison feature, but not sure it can handle your file count.
It might be worth a try, I hadn't thought of it.
-
@dustinb3403 That's what I was thinking.
You'll still be in the shape of how do you compare two stupidly large files, though.
-
@dafyre said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 That's what I was thinking.
You'll still be in the shape of how do you compare two stupidly large files, though.
Yeah, while that is certainly a part of the challenge, the larger portion is just checking to see if the bulk is all aligned and matching...
If any tooling had some way to "skip large files" and just jot down their names then a simple stare and compare might work in that case.
-
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@flaxking said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@flaxking said in Checking multiple Directories to confirm all files are identical:
I would think you would be able to use robocopy to do a diff
Probably, but the issue still comes down to system resources.
Anything that is storing in memory will quickly consume the available resources.
Maybe if I pipe the about to a file it won't be so bad..
It's bound to be a lot more efficient than your powershell.
It's still going to consume more ram than any host in the environment has to process the job. Just between any 2 directories there's over 20 million files.
I don't know how it's implemented, so I can't say. Just create a new powershell script that doesn't store as much in memory. I think if you pipe to ForEach-Object it actually starts operating before the get-childitem gets all the objects and then don't store those objects in a variable. So maybe it will start garbage collection before you are done
-
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@dafyre said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 That's what I was thinking.
You'll still be in the shape of how do you compare two stupidly large files, though.
Yeah, while that is certainly a part of the challenge, the larger portion is just checking to see if the bulk is all aligned and matching...
If any tooling had some way to "skip large files" and just jot down their names then a simple stare and compare might work in that case.
So are you looking to compare bit for bit -- or just file name and size ?
-
@dafyre said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 said in Checking multiple Directories to confirm all files are identical:
@dafyre said in Checking multiple Directories to confirm all files are identical:
@dustinb3403 That's what I was thinking.
You'll still be in the shape of how do you compare two stupidly large files, though.
Yeah, while that is certainly a part of the challenge, the larger portion is just checking to see if the bulk is all aligned and matching...
If any tooling had some way to "skip large files" and just jot down their names then a simple stare and compare might work in that case.
So are you looking to compare bit for bit -- or just file name and size ?
Name, size date ideally. bit for bit is overkill and I can't image the client would want to wait for who knows how long to get an answer for this.
-
One thought - run a md5 hash and output filename date hash to a file, then compare the contents of the files between the servers.
you could run the job individually on each server so all three plus devices could run at once - assuming not all on the same VM host.
-
@dashrender said in Checking multiple Directories to confirm all files are identical:
One thought - run a md5 hash and output filename date hash to a file, then compare the contents of the files between the servers.
you could run the job individually on each server so all three plus devices could run at once - assuming not all on the same VM host.
That actually isn't a bad idea, time consuming still but would probably be way more lightweight than trying to perform a live comparison between the systems.
Just use something like Meld to compare the text files after the fact.
-
@dashrender said in Checking multiple Directories to confirm all files are identical:
One thought - run a md5 hash and output filename date hash to a file, then compare the contents of the files between the servers.
you could run the job individually on each server so all three plus devices could run at once - assuming not all on the same VM host.
This appears to be working, thanks for the idea!
dir D:\Files -Recurse | Get-FileHash -ea Continue > C:\D-Files.txt
Dumps out one large file, I could then use a file comparison tool to quickly check these outputs.
-
@dustinb3403 You'll likely have to do some type of line fixup, i.e. if a file is missing, then every line after that would be a mismatch...