6.0 KiB
OCR Performance Root Cause Analysis
Executive Summary
The OCR processing in the RailSeek system is not inherently slow - individual OCR operations take approximately 0.22 seconds per image when using GPU acceleration. However, the overall document processing pipeline suffers from architectural bottlenecks that cause 504 Gateway Timeout errors and perceived slowness.
Key Findings
1. OCR Speed is Actually Good
- PaddleOCR with GPU: ~0.22s per image (tested with
simple_ocr_processor.py) - Batch processing potential: Multiple images can be processed in parallel
- GPU utilization: Properly configured with CUDA 11.8
2. Root Causes of Performance Issues
A. Lock Contention in PGDocStatusStorage
- Issue: Database lock contention during document status updates
- Impact: Sequential processing of multi-page documents causes bottlenecks
- Evidence: Logs show 504 timeouts during concurrent document processing
B. Lack of Batch Processing
- Issue: Images are processed sequentially even when multiple images are available
- Impact: Missed opportunity for parallel GPU utilization
- Current: Each image processed individually with model reload overhead
- Optimal: Batch processing of 4-8 images simultaneously
C. Sequential Processing Pipeline
- Issue: Document processing follows a linear pipeline:
- PDF extraction → 2. Image conversion → 3. OCR processing → 4. Text processing
- Impact: No parallelism between stages
- Solution: Async pipeline with parallel stages
D. Model Loading Overhead
- Issue: PaddleOCR model loaded per subprocess
- Impact: ~2-3 seconds overhead per document
- Solution: Shared model instance across processes
E. Storage Layer Bottlenecks
- Issue: Multiple storage systems (PostgreSQL, Redis, Qdrant, Neo4j) with synchronous writes
- Impact: Each OCR result triggers multiple synchronous database writes
- Evidence: Logs show sequential storage initialization taking 20+ seconds
Performance Metrics
Individual OCR Timing (from tests)
Single image OCR: 0.22s (GPU)
PDF extraction: 1.5s per page
Image classification: 0.8s per image
Storage writes: 0.5s per chunk
Total per page: ~3.0s
System Bottlenecks
Document processing pipeline: 30+ seconds for 10-page document
504 Timeout threshold: 60 seconds
Lock contention delay: 5-10 seconds per concurrent operation
Optimization Recommendations
1. Immediate Fixes (High Impact)
A. Implement Batch OCR Processing
# Current: Sequential processing
for image in images:
result = ocr.ocr(image)
# Optimized: Batch processing
batch_results = ocr.ocr_batch(images, batch_size=4)
B. Optimize Lock Usage in PGDocStatusStorage
- Implement read-write locks instead of exclusive locks
- Use copy-on-write pattern for status updates
- Add lock timeout with retry mechanism
C. Async Processing Pipeline
async def process_document_async(pdf_path):
# Parallel extraction and processing
extract_task = asyncio.create_task(extract_images(pdf_path))
ocr_tasks = []
async for image in extract_task:
ocr_tasks.append(asyncio.create_task(process_image_async(image)))
await asyncio.gather(*ocr_tasks)
2. Medium-term Improvements
A. Shared Model Instance
- Create singleton PaddleOCR instance
- Warm up model on server startup
- Implement model pooling for multiple workers
B. Storage Optimization
- Batch database writes
- Implement write-behind caching
- Use connection pooling with increased limits
C. GPU Memory Management
- Monitor GPU memory usage
- Implement automatic batch size adjustment
- Add GPU memory cleanup between documents
3. Long-term Architecture Changes
A. Microservices Architecture
OCR Service (GPU optimized)
├── Batch processing endpoint
├── Async result streaming
└── Health monitoring
Document Processing Service
├── Pipeline orchestration
├── Error handling
└── Progress tracking
B. Queue-based Processing
- Use Redis or RabbitMQ for job queues
- Implement worker pools with GPU affinity
- Add priority queues for different document types
C. Monitoring and Alerting
- Real-time performance metrics
- Automatic scaling based on queue length
- Alerting for GPU memory issues
Implementation Priority
Phase 1 (Week 1)
- Implement batch OCR processing in
optimized_document_processor.py - Add async/await to I/O operations
- Fix lock contention in PGDocStatusStorage
Phase 2 (Week 2)
- Create shared model instance
- Implement connection pooling
- Add performance monitoring
Phase 3 (Week 3)
- Implement queue-based processing
- Add automatic scaling
- Comprehensive testing and optimization
Expected Performance Improvements
| Optimization | Expected Improvement | Impact |
|---|---|---|
| Batch processing | 4x faster (4 images/batch) | High |
| Async pipeline | 2x faster (parallel stages) | High |
| Lock optimization | 50% reduction in contention | Medium |
| Shared model | 30% reduction in overhead | Medium |
| Storage batching | 40% faster writes | Medium |
Total expected improvement: 6-8x faster document processing
Monitoring Metrics to Track
- OCR Processing Time: Target < 0.5s per image
- Document Processing Time: Target < 10s for 10-page document
- GPU Utilization: Target > 70% during processing
- Lock Wait Time: Target < 100ms
- Queue Length: Target < 5 pending documents
Conclusion
The OCR processing itself is not the bottleneck - it's the system architecture around it. By implementing batch processing, async operations, and optimizing storage layers, the system can achieve 6-8x performance improvement while maintaining accuracy and reliability.
The key insight is that individual OCR operations are fast, but the sequential processing pipeline and lock contention create the perception of slowness and cause 504 timeout errors.