76 lines
3.2 KiB
Markdown
76 lines
3.2 KiB
Markdown
# Paperless-ngx Stack
|
|
|
|
A document management system that transforms your physical documents into a searchable online archive. Scan, index, and archive all your documents with powerful OCR and AI-powered organization.
|
|
|
|
## Services Overview
|
|
|
|
- **webserver**: Main Paperless-ngx application with web interface and API
|
|
- **db**: PostgreSQL database for document metadata and full-text search
|
|
- **broker**: Redis message broker for background task processing
|
|
- **gotenberg**: Document conversion service for Office files and web pages
|
|
- **tika**: Text extraction service for various file formats
|
|
- **backup-files**: Automated file backups using resticprofile with AWS S3
|
|
- **backup-database**: Automated PostgreSQL database dumps
|
|
|
|
## Key Features
|
|
|
|
- **OCR Processing**: Automatic text extraction from scanned documents
|
|
- **AI Tagging**: Machine learning-powered document classification and tagging
|
|
- **Full-Text Search**: Fast searching across all document contents
|
|
- **Document Types**: Support for PDF, images, Office documents, emails
|
|
- **Web Interface**: Modern, responsive web UI for document management
|
|
- **REST API**: Full API for integration with other applications
|
|
- **Barcode Support**: QR code and barcode recognition for automated filing
|
|
- **Email Integration**: Import documents via email
|
|
- **Multi-user**: User management with permission controls
|
|
|
|
## Links & Documentation
|
|
|
|
- **Official Website**: https://paperless-ngx.com/
|
|
- **GitHub Repository**: https://github.com/paperless-ngx/paperless-ngx
|
|
- **Documentation**: https://docs.paperless-ngx.com/
|
|
- **Docker Hub**: https://hub.docker.com/r/paperlessngx/paperless-ngx
|
|
- **Demo**: https://demo.paperless-ngx.com/ (admin/demo)
|
|
- **Community**: https://github.com/paperless-ngx/paperless-ngx/discussions
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
Copy `stack.env` to `stack.env.real` and configure:
|
|
|
|
- `PAPERLESS_*`: Application-specific settings (database, OCR languages, secret key)
|
|
- `TZ`: Timezone
|
|
- `TRAEFIK_DOMAIN`: Domain for web access
|
|
- `CONSUME_PATH`: Directory for automatic document consumption
|
|
- `AWS_*`: AWS S3 credentials for backups
|
|
- `SERVICE_DATA_ROOT_PATH`: Base path for service data
|
|
- `USERMAP_UID/USERMAP_GID`: User/group IDs for file permissions
|
|
|
|
### OCR Languages
|
|
Configure `PAPERLESS_OCR_LANGUAGE` and `PAPERLESS_OCR_LANGUAGES` for multi-language OCR support.
|
|
|
|
### Network Access
|
|
- **Web Interface**: Accessible via Traefik at configured domain
|
|
- **Document Consumption**: Place documents in the consume directory for automatic processing
|
|
|
|
## Document Processing Pipeline
|
|
|
|
1. **Intake**: Documents added via web upload, email, or consume folder
|
|
2. **OCR**: Text extraction using Tesseract with configured languages
|
|
3. **Text Extraction**: Additional text processing via Tika for office documents
|
|
4. **PDF Generation**: Gotenberg converts office documents to searchable PDFs
|
|
5. **Classification**: AI-powered tagging and document type detection
|
|
6. **Storage**: Organized storage with full-text search indexing
|
|
|
|
## Backup Strategy
|
|
|
|
**Database**: Hourly PostgreSQL dumps with 2-hour retention
|
|
|
|
**Files**: Automated S3 backups of documents and media using resticprofile
|
|
|
|
## Dependencies
|
|
|
|
- External Traefik reverse proxy network
|
|
- AWS S3 bucket for backups
|
|
- Consume directory for document intake
|