Paperless-ngx Stack

A document management system that transforms your physical documents into a searchable online archive. Scan, index, and archive all your documents with powerful OCR and AI-powered organization.

Services Overview

webserver: Main Paperless-ngx application with web interface and API
db: PostgreSQL database for document metadata and full-text search
broker: Redis message broker for background task processing
gotenberg: Document conversion service for Office files and web pages
tika: Text extraction service for various file formats
backup-files: Automated file backups using resticprofile with AWS S3
backup-database: Automated PostgreSQL database dumps

Key Features

OCR Processing: Automatic text extraction from scanned documents
AI Tagging: Machine learning-powered document classification and tagging
Full-Text Search: Fast searching across all document contents
Document Types: Support for PDF, images, Office documents, emails
Web Interface: Modern, responsive web UI for document management
REST API: Full API for integration with other applications
Barcode Support: QR code and barcode recognition for automated filing
Email Integration: Import documents via email
Multi-user: User management with permission controls

Links & Documentation

Official Website: https://paperless-ngx.com/
GitHub Repository: https://github.com/paperless-ngx/paperless-ngx
Documentation: https://docs.paperless-ngx.com/
Docker Hub: https://hub.docker.com/r/paperlessngx/paperless-ngx
Demo: https://demo.paperless-ngx.com/ (admin/demo)
Community: https://github.com/paperless-ngx/paperless-ngx/discussions

Configuration

Environment Variables

Copy stack.env to stack.env.real and configure:

PAPERLESS_*: Application-specific settings (database, OCR languages, secret key)
TZ: Timezone
TRAEFIK_DOMAIN: Domain for web access
CONSUME_PATH: Directory for automatic document consumption
AWS_*: AWS S3 credentials for backups
SERVICE_DATA_ROOT_PATH: Base path for service data
USERMAP_UID/USERMAP_GID: User/group IDs for file permissions

OCR Languages

Configure PAPERLESS_OCR_LANGUAGE and PAPERLESS_OCR_LANGUAGES for multi-language OCR support.

Network Access

Web Interface: Accessible via Traefik at configured domain
Document Consumption: Place documents in the consume directory for automatic processing

Document Processing Pipeline

Intake: Documents added via web upload, email, or consume folder
OCR: Text extraction using Tesseract with configured languages
Text Extraction: Additional text processing via Tika for office documents
PDF Generation: Gotenberg converts office documents to searchable PDFs
Classification: AI-powered tagging and document type detection
Storage: Organized storage with full-text search indexing

Backup Strategy

Database: Hourly PostgreSQL dumps with 2-hour retention

Files: Automated S3 backups of documents and media using resticprofile

Dependencies

External Traefik reverse proxy network
AWS S3 bucket for backups
Consume directory for document intake

3.2 KiB Raw Blame History