OCR Document Detection Script ============================= Purpose ------- This script automatically scans images synchronized via Syncthing, detects whether an image contains a document or a normal photo, and moves detected documents into a Paperless consume folder. The system runs completely locally and does not require cloud services. General Workflow ---------------- 1. Syncthing synchronizes images from the phone to the server. 2. The script runs periodically (Task Scheduler). 3. Each new image is processed once: - OCR is executed using Tesseract - Text quality is evaluated - Image classified as DOCUMENT or PHOTO 4. Documents are renamed based on OCR text. 5. Documents are moved into the consume folder. 6. Processed files are stored in processed.txt to avoid reprocessing. Directory Structure ------------------- C:\Script\ ocr-scan.ps1 Main script config.txt Configuration file processed.txt List of already processed files README.txt Documentation Configuration ------------- All configurable parameters are stored in config.txt. Examples: - sourcePath Folder monitored for new images - targetPath Paperless consume folder - OCR language - detection thresholds Important Design Decisions -------------------------- - Files are never modified after syncing. - No renaming in source folder (prevents Syncthing loops). - Processing state is stored locally only. - No database required. - OCR runs only once per file. Detection Logic --------------- A document is detected when: - OCR finds enough alphabetic characters - Several long words exist in the recognized text This prevents normal photos from being classified as documents, even when OCR produces noise. Extending the Script -------------------- Future improvements should follow this principle: - Add new functions instead of modifying existing ones. - Keep OCR, classification, and file operations separated. Typical future extensions: - Document type detection (receipt, invoice, shipping label) - Improved filename generation - Logging to file - Module-based classification rules Requirements ------------ - Windows Server / Windows - PowerShell - Tesseract OCR - ImageMagick - Syncthing (optional) Notes ----- This script is designed for stability and simplicity. Avoid adding complex AI or external services unless required.