OCR incoming fax PDFs in OSCAR

Share this

You can use the open source Tesseract ocrmypdf BASH command to OCR PDF’s. OCR (Optical Character Recognition) makes PDFs searchable, i.e. incoming faxes/documents can be OCR’d to make finding information in PDFs easier.

https://ocrmypdf.readthedocs.io/en/latest/

Take the fax pdf folder and OCR it and put it into an OCR’ed folder, which you can still dump to your inbox or where ever you want. Also, you can use a backup server to do the OCR processing so as not to load your main server.

It makes a OCRed pdf and a ripped text file of it. You can then take the ripped text file from the pdf and regex for OHIP number and other keywords to classify it.

Shared by Ian Pun.

#ocrmydocumentbuffer.sh
#
#By Ian Pun March 2022 to ocrmypdf then pdfmytext in local FILES directory
#
# copies pdf to remote server directory ~/OCRfile where the ocrmypdf and pdfmytext is contained
# install oscrmypdf and pdfmytext 
# ocrmypdf and pdfmytext those processes files on remote server
# and copies back to local OCR_FILES directory . A subdirectory OCR_FILES/txt stores the scraped text of the pdf
#
# forceocrflag - will force OCR on rerasterize pdf if set
# issues of filenames with spaces
#
# local server and remote server are already tied ssh with private/public keys stored in ~/.ssh

REMOTESERVERPORT=remoteSSHPORT
REMOTEUSER="OSCAR@BACKUPIP"

OSCARDOCUSER="[email protected]"
SQLPASSWORD="OSCARPASSWORD"
SQLDB="oscar_15"


#forceocrflag="--force-ocr"

forceocrflag=""

FILES="PUT YOUR file TO BE OCRed here"
OCR_FILES="THE OCR files are here"


cd $FILES

[ "$(ls $FILES)" ] && echo "Not Empty" || exit 


for f in *.pdf; do

    echo $f

   
   scp -P ${REMOTESERVERPORT} "${f}" ${REMOTEUSER}:"~/OCRfile"

   ssh -p ${REMOTESERVERPORT} ${REMOTEUSER} "ocrmypdf  ${forceocrflag} ~/OCRfile/'${f}'  ~/OCRfile/'OCR_${f}'"

   scp -P ${REMOTESERVERPORT} ${REMOTEUSER}:"~/OCRfile/'OCR_$f'" "${OCR_FILES}/${f}"



   ssh -p ${REMOTESERVERPORT} ${REMOTEUSER} "pdftotext  ~/OCRfile/'OCR_${f}'  ~/OCRfile/'OCR_${f}.txt'"

   scp -P ${REMOTESERVERPORT} ${REMOTEUSER}:"~/OCRfile/'OCR_${f}.txt'" "${OCR_FILES}/txt/${f}.txt"



done

cd