OCR documents scanned in to folder

Mike Davis

10 or 15 years ago I set up Omnipage to watch a folder and if a .pdf was dropped in, it would OCR it and drop the result (either a OCR .pdf or word document) in to another folder. It's starting to have issues and I wondered what others are doing.

Does anyone know of a program that can be set to monitor a folder and OCR the documents that drop in there without user interaction?

The application is that the company has lots of scanners around the office and when they walk up, they just hit the scan template that says "OCR pdf" and hit the green button. They go back to their desk and the output file is sitting on a network share they have mapped.

Dashrender

What's the issue you are having?

Mike Davis

For some reason, the queue seems to get jammed up. It will stop processing documents. They'll clear them all out and restart the server and then it will some times work again. I haven't looked at it myself to troubleshoot it better.

Dashrender

It's been working for 10-15 years. Why is it suddenly not? A conflict with a system update? I'd definitely look into why it's failing before looking to just replace it.

Mike Davis

I have to wonder if they upped the resolution of the scan template or something. That's what I would check if they let me check it out before messing with it.

travisdh1

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Dashrender

@travisdh1 said in OCR documents scanned in to folder:

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Well, it's on github, is it even available anymore?

travisdh1

@Dashrender said in OCR documents scanned in to folder:

@travisdh1 said in OCR documents scanned in to folder:

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Well, it's on github, is it even available anymore?

Eich, hopefully. Should be available via repositories if not directly from github anymore.

Mike Davis

For this project I can't really consider linux because I can't really support linux.

scottalanmiller

@Dashrender said in OCR documents scanned in to folder:

@travisdh1 said in OCR documents scanned in to folder:

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Well, it's on github, is it even available anymore?

I think you are confusing GitLab with GitHub unless you know something that I don't.

Dashrender

@scottalanmiller said in OCR documents scanned in to folder:

@Dashrender said in OCR documents scanned in to folder:

@travisdh1 said in OCR documents scanned in to folder:

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Well, it's on github, is it even available anymore?

I think you are confusing GitLab with GitHub unless you know something that I don't.

Did you really just ask that?

scottalanmiller

@Mike-Davis said in OCR documents scanned in to folder:

For this project I can't really consider linux because I can't really support linux.

But isn't it Windows that is unsupportable here?

Having to use special software on Windows vs. a basic script on Linux seems like a support nightmare compared to something that should "just work."

scottalanmiller

@Dashrender said in OCR documents scanned in to folder:

@scottalanmiller said in OCR documents scanned in to folder:

@Dashrender said in OCR documents scanned in to folder:

@travisdh1 said in OCR documents scanned in to folder:

Network scanning has always been a pain in my neck. Wondering how others have automated this.

On a linux box I'd just be running a small script that would use tesseract-ocr, and have it run from cron every 15 seconds. Something like that anyway.

Well, it's on github, is it even available anymore?

I think you are confusing GitLab with GitHub unless you know something that I don't.

Did you really just ask that?

So GitHub is fine as usual?

Mike Davis

@scottalanmiller said in OCR documents scanned in to folder:

Having to use special software on Windows vs. a basic script on Linux seems like a support nightmare compared to something that should "just work."

I've never set up a cron job in linux. Their onsite IT has never logged in to linux. I don't think it would be a good idea to put something like that in my customers environment.

scottalanmiller

@Mike-Davis said in OCR documents scanned in to folder:

@scottalanmiller said in OCR documents scanned in to folder:

Having to use special software on Windows vs. a basic script on Linux seems like a support nightmare compared to something that should "just work."

I've never set up a cron job in linux. Their onsite IT has never logged in to linux. I don't think it would be a good idea to put something like that in my customers environment.

You've never used whatever new, untested thing you'd use on Windows either. The differences would be:

One is free, one is costly.
One is enterprise battle tested, one... who knows.
One is industry standard and can be supported by anyone, the other... who knows.
One will keep itself fully updated and patched for a decade or more and can be trivially updated beyond that.

That their onsite IT isn't prepared for simple tasks should not necessarily imply that we don't provide good solutions. It just means that their IT is not prepared to support anything. It is what it is. If having never used Linux is a reason to not consider Linux, then surely that logic applies to getting the a new product to support as well.

In many ways, the logic you use to rule out Linux would also rule it in.

Mike Davis

Can you give me an estimate in number of hours to build a linux box and configure that package?

scottalanmiller

@Mike-Davis said in OCR documents scanned in to folder:

Can you give me an estimate in number of hours to build a linux box and configure that package?

I don't know anything about the OCR piece. But time to build a box is normally about five minutes for me. The script, maybe ten to fifteen. The real issues will be time to download the ISO for them and questions about their environment. The Linux and cron pieces are essentially zero effort items. All of the factors that might create effort are the parts we don't know about.

JaredBusch

@scottalanmiller said in OCR documents scanned in to folder:

@Mike-Davis said in OCR documents scanned in to folder:

Can you give me an estimate in number of hours to build a linux box and configure that package?

I don't know anything about the OCR piece. But time to build a box is normally about five minutes for me. The script, maybe ten to fifteen. The real issues will be time to download the ISO for them and questions about their environment. The Linux and cron pieces are essentially zero effort items. All of the factors that might create effort are the parts we don't know about.

Hello, real world calling.

Time to build a box != 5 minutes ever. Time for you to spin up a VM from a template and configure the basics, I would accept.

Even assuming that the latest CentOS 7 release ISO was on his client's infrastructure and ready to attach, it would take more time than that to configure the new VM, boot, install, reboot, update, and configure.

scottalanmiller

@JaredBusch said in OCR documents scanned in to folder:

@scottalanmiller said in OCR documents scanned in to folder:

@Mike-Davis said in OCR documents scanned in to folder:

Can you give me an estimate in number of hours to build a linux box and configure that package?

I don't know anything about the OCR piece. But time to build a box is normally about five minutes for me. The script, maybe ten to fifteen. The real issues will be time to download the ISO for them and questions about their environment. The Linux and cron pieces are essentially zero effort items. All of the factors that might create effort are the parts we don't know about.

Hello, real world calling.

Time to build a box != 5 minutes ever. Time for you to spin up a VM from a template and configure the basics, I would accept.

Even assuming that the latest CentOS 7 release ISO was on his client's infrastructure and ready to attach, it would take more time than that to configure the new VM, boot, install, reboot, update, and configure.

That's why it matters as to the environment. I can build a VM locally, and ship it digitally all ready to go based on ready to go templates. Just need to run the latest updates (two minutes normally) and apply the IP address and hostname. Then time to transfer the file is not in the five minutes, but doesn't take labour time, either.