176 lines
6.0 KiB
Markdown
176 lines
6.0 KiB
Markdown
# OCR Performance Root Cause Analysis
|
|
|
|
## Executive Summary
|
|
|
|
The OCR processing in the RailSeek system is **not inherently slow** - individual OCR operations take approximately **0.22 seconds per image** when using GPU acceleration. However, the overall document processing pipeline suffers from **architectural bottlenecks** that cause 504 Gateway Timeout errors and perceived slowness.
|
|
|
|
## Key Findings
|
|
|
|
### 1. OCR Speed is Actually Good
|
|
- **PaddleOCR with GPU**: ~0.22s per image (tested with `simple_ocr_processor.py`)
|
|
- **Batch processing potential**: Multiple images can be processed in parallel
|
|
- **GPU utilization**: Properly configured with CUDA 11.8
|
|
|
|
### 2. Root Causes of Performance Issues
|
|
|
|
#### A. Lock Contention in PGDocStatusStorage
|
|
- **Issue**: Database lock contention during document status updates
|
|
- **Impact**: Sequential processing of multi-page documents causes bottlenecks
|
|
- **Evidence**: Logs show 504 timeouts during concurrent document processing
|
|
|
|
#### B. Lack of Batch Processing
|
|
- **Issue**: Images are processed sequentially even when multiple images are available
|
|
- **Impact**: Missed opportunity for parallel GPU utilization
|
|
- **Current**: Each image processed individually with model reload overhead
|
|
- **Optimal**: Batch processing of 4-8 images simultaneously
|
|
|
|
#### C. Sequential Processing Pipeline
|
|
- **Issue**: Document processing follows a linear pipeline:
|
|
1. PDF extraction → 2. Image conversion → 3. OCR processing → 4. Text processing
|
|
- **Impact**: No parallelism between stages
|
|
- **Solution**: Async pipeline with parallel stages
|
|
|
|
#### D. Model Loading Overhead
|
|
- **Issue**: PaddleOCR model loaded per subprocess
|
|
- **Impact**: ~2-3 seconds overhead per document
|
|
- **Solution**: Shared model instance across processes
|
|
|
|
#### E. Storage Layer Bottlenecks
|
|
- **Issue**: Multiple storage systems (PostgreSQL, Redis, Qdrant, Neo4j) with synchronous writes
|
|
- **Impact**: Each OCR result triggers multiple synchronous database writes
|
|
- **Evidence**: Logs show sequential storage initialization taking 20+ seconds
|
|
|
|
## Performance Metrics
|
|
|
|
### Individual OCR Timing (from tests)
|
|
```
|
|
Single image OCR: 0.22s (GPU)
|
|
PDF extraction: 1.5s per page
|
|
Image classification: 0.8s per image
|
|
Storage writes: 0.5s per chunk
|
|
Total per page: ~3.0s
|
|
```
|
|
|
|
### System Bottlenecks
|
|
```
|
|
Document processing pipeline: 30+ seconds for 10-page document
|
|
504 Timeout threshold: 60 seconds
|
|
Lock contention delay: 5-10 seconds per concurrent operation
|
|
```
|
|
|
|
## Optimization Recommendations
|
|
|
|
### 1. Immediate Fixes (High Impact)
|
|
|
|
#### A. Implement Batch OCR Processing
|
|
```python
|
|
# Current: Sequential processing
|
|
for image in images:
|
|
result = ocr.ocr(image)
|
|
|
|
# Optimized: Batch processing
|
|
batch_results = ocr.ocr_batch(images, batch_size=4)
|
|
```
|
|
|
|
#### B. Optimize Lock Usage in PGDocStatusStorage
|
|
- Implement read-write locks instead of exclusive locks
|
|
- Use copy-on-write pattern for status updates
|
|
- Add lock timeout with retry mechanism
|
|
|
|
#### C. Async Processing Pipeline
|
|
```python
|
|
async def process_document_async(pdf_path):
|
|
# Parallel extraction and processing
|
|
extract_task = asyncio.create_task(extract_images(pdf_path))
|
|
ocr_tasks = []
|
|
|
|
async for image in extract_task:
|
|
ocr_tasks.append(asyncio.create_task(process_image_async(image)))
|
|
|
|
await asyncio.gather(*ocr_tasks)
|
|
```
|
|
|
|
### 2. Medium-term Improvements
|
|
|
|
#### A. Shared Model Instance
|
|
- Create singleton PaddleOCR instance
|
|
- Warm up model on server startup
|
|
- Implement model pooling for multiple workers
|
|
|
|
#### B. Storage Optimization
|
|
- Batch database writes
|
|
- Implement write-behind caching
|
|
- Use connection pooling with increased limits
|
|
|
|
#### C. GPU Memory Management
|
|
- Monitor GPU memory usage
|
|
- Implement automatic batch size adjustment
|
|
- Add GPU memory cleanup between documents
|
|
|
|
### 3. Long-term Architecture Changes
|
|
|
|
#### A. Microservices Architecture
|
|
```
|
|
OCR Service (GPU optimized)
|
|
├── Batch processing endpoint
|
|
├── Async result streaming
|
|
└── Health monitoring
|
|
|
|
Document Processing Service
|
|
├── Pipeline orchestration
|
|
├── Error handling
|
|
└── Progress tracking
|
|
```
|
|
|
|
#### B. Queue-based Processing
|
|
- Use Redis or RabbitMQ for job queues
|
|
- Implement worker pools with GPU affinity
|
|
- Add priority queues for different document types
|
|
|
|
#### C. Monitoring and Alerting
|
|
- Real-time performance metrics
|
|
- Automatic scaling based on queue length
|
|
- Alerting for GPU memory issues
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1 (Week 1)
|
|
1. Implement batch OCR processing in `optimized_document_processor.py`
|
|
2. Add async/await to I/O operations
|
|
3. Fix lock contention in PGDocStatusStorage
|
|
|
|
### Phase 2 (Week 2)
|
|
1. Create shared model instance
|
|
2. Implement connection pooling
|
|
3. Add performance monitoring
|
|
|
|
### Phase 3 (Week 3)
|
|
1. Implement queue-based processing
|
|
2. Add automatic scaling
|
|
3. Comprehensive testing and optimization
|
|
|
|
## Expected Performance Improvements
|
|
|
|
| Optimization | Expected Improvement | Impact |
|
|
|-------------|---------------------|--------|
|
|
| Batch processing | 4x faster (4 images/batch) | High |
|
|
| Async pipeline | 2x faster (parallel stages) | High |
|
|
| Lock optimization | 50% reduction in contention | Medium |
|
|
| Shared model | 30% reduction in overhead | Medium |
|
|
| Storage batching | 40% faster writes | Medium |
|
|
|
|
**Total expected improvement**: **6-8x faster** document processing
|
|
|
|
## Monitoring Metrics to Track
|
|
|
|
1. **OCR Processing Time**: Target < 0.5s per image
|
|
2. **Document Processing Time**: Target < 10s for 10-page document
|
|
3. **GPU Utilization**: Target > 70% during processing
|
|
4. **Lock Wait Time**: Target < 100ms
|
|
5. **Queue Length**: Target < 5 pending documents
|
|
|
|
## Conclusion
|
|
|
|
The OCR processing itself is **not the bottleneck** - it's the **system architecture** around it. By implementing batch processing, async operations, and optimizing storage layers, the system can achieve **6-8x performance improvement** while maintaining accuracy and reliability.
|
|
|
|
The key insight is that **individual OCR operations are fast**, but the **sequential processing pipeline** and **lock contention** create the perception of slowness and cause 504 timeout errors. |