Massive Searchable Document/File Repository
-
The short of it is this. I have 30ish TB of data. Lots of it is irrelevant. But What i'm needing is essentially to be able to search within the text/office/pdf/files. Anything that is textual I need to be able to search (globally).
Trying Nextcloud, but i'm not intelligent enough to make it work. Anyone have experience with such an endeavor ?
-
Also, lots of PDFs here. so will have to OCR as well. If we have a Linux Guru on here that has some time to help out i'm sure Nextcloud can do it. Im just not able.
-
https://nextcloud.com/industries/legal/
This is essentially what i'd love to happen -
Is Windows Server File Indexing an option? If so, what about OCR en masse?
-
I can't say for sure, but I think what you're looking for surpasses nextcloud's feature set.
With 30 TB you're getting into serious DM space. I'm not sure if it'll suit your needs, but Alfresco might be worth a look.
-
@hubtechagain said in Massive Searchable Document/File Repository:
Is Windows Server File Indexing an option? If so, what about OCR en masse?
Yes, windows file indexing will index all the document types you mentioned, except pdfs. You need Adobe pdf ifilter installed (free) to index text pdf files. If your pdf documents are scanned images, then you'd need Adobe Acrobat to OCR en masse, but after that pdfs will be index-able.
-
@notverypunny well not near all of it is actual text. it's a lot of computer images etc. Really i just need the OCR Full Text Search to work so we can dig through the readable data.
-
@notverypunny what's your experience with Alfresco?
-
@hubtechagain Minimal to be honest. I'd looked at it a couple of times for replacing file servers but the combination of intertia and overall lack of buy-in from the stakeholders meant that I never really got past the testing / demo / proof of concept phase.
-
Actually, MayanEDMS might be what you're looking for. It does OCR and indexing. I have a running instance, but I haven't used it at all yet.
-
@marcinozga said in Massive Searchable Document/File Repository:
Actually, MayanEDMS might be what you're looking for. It does OCR and indexing. I have a running instance, but I haven't used it at all yet.
This looks interesting. I wonder how well it can catalog other digital assets (images, video, etc)