156 lines
5.2 KiB
Markdown
156 lines
5.2 KiB
Markdown
# OpenCLIP GPU Performance Analysis & Optimization
|
|
|
|
## Performance Issue Analysis
|
|
|
|
The slow classification performance with OpenCLIP GPU is likely due to several factors:
|
|
|
|
### **1. Subprocess Overhead**
|
|
- **Issue**: Each image classification starts a new Python subprocess
|
|
- **Impact**: High process creation overhead (100-500ms per image)
|
|
- **Current Architecture**: Subprocess per image for dependency isolation
|
|
|
|
### **2. Model Loading Time**
|
|
- **Issue**: OpenCLIP model loads from scratch for each subprocess
|
|
- **Model**: ViT-B-32 (151M parameters)
|
|
- **Load Time**: 2-5 seconds per subprocess
|
|
- **Impact**: Significant delay for first image in each batch
|
|
|
|
### **3. Small Batch Sizes**
|
|
- **Issue**: Processing images one at a time
|
|
- **GPU Utilization**: Low due to small batches
|
|
- **Memory Transfer**: Frequent CPU↔GPU transfers
|
|
|
|
### **4. Text Encoding Overhead**
|
|
- **Issue**: Text labels re-encoded for each image
|
|
- **Labels**: 27 text labels encoded every time
|
|
- **Impact**: Unnecessary computation repetition
|
|
|
|
## Performance Optimization Strategies
|
|
|
|
### **1. Batch Processing**
|
|
```python
|
|
# Current: One image per subprocess
|
|
results = classifier.classify_image(image_path)
|
|
|
|
# Optimized: Multiple images per subprocess
|
|
results = classifier.classify_images_batch([image_path1, image_path2, ...])
|
|
```
|
|
|
|
### **2. Persistent Subprocess Service**
|
|
```python
|
|
# Long-running classification service
|
|
class ClassificationService:
|
|
def __init__(self):
|
|
self.model = None
|
|
self.text_features = None # Precomputed
|
|
|
|
def start_service(self):
|
|
# Load model once, keep in memory
|
|
self.model, _, self.processor = open_clip.create_model_and_transforms(...)
|
|
self.model = self.model.cuda()
|
|
|
|
# Precompute text features once
|
|
self.text_features = self.encode_text_labels()
|
|
|
|
def classify_batch(self, image_paths):
|
|
# Process multiple images in one GPU call
|
|
batch_tensors = [self.processor(Image.open(path)) for path in image_paths]
|
|
batch_tensor = torch.stack(batch_tensors).cuda()
|
|
|
|
with torch.no_grad():
|
|
image_features = self.model.encode_image(batch_tensor)
|
|
# Use precomputed text_features
|
|
similarities = (100.0 * image_features @ self.text_features.T).softmax(dim=-1)
|
|
```
|
|
|
|
### **3. Model Optimization**
|
|
- **Use Smaller Model**: ViT-B-16 instead of ViT-B-32 (faster, slightly less accurate)
|
|
- **Quantization**: Use half-precision (FP16) for faster inference
|
|
- **Model Caching**: Keep model loaded in GPU memory
|
|
|
|
### **4. Asynchronous Processing**
|
|
```python
|
|
async def process_images_async(image_paths):
|
|
# Process multiple images concurrently
|
|
tasks = [classifier.classify_image_async(path) for path in image_paths]
|
|
results = await asyncio.gather(*tasks)
|
|
return results
|
|
```
|
|
|
|
## Immediate Quick Wins
|
|
|
|
### **1. Reduce Text Labels**
|
|
```python
|
|
# Current: 27 labels
|
|
text_labels = [
|
|
"a photo of a bee", "a photo of a flower", "a photo of a person",
|
|
"a photo of a document", "a photo of a chart", "a photo of a diagram",
|
|
# ... 21 more labels
|
|
]
|
|
|
|
# Optimized: 10 most relevant labels
|
|
text_labels = [
|
|
"a photo of a bee", "a photo of a flower", "a photo of a document",
|
|
"a photo of a chart", "a photo of a diagram", "a photo of a table",
|
|
"a photo of a graph", "a photo of a screenshot", "a photo of a logo",
|
|
"a photo of text"
|
|
]
|
|
```
|
|
|
|
### **2. Use Smaller Model**
|
|
```python
|
|
# Current: ViT-B-32 (151M parameters)
|
|
model, _, processor = open_clip.create_model_and_transforms(
|
|
model_name="ViT-B-32",
|
|
pretrained="laion2b_s34b_b79k"
|
|
)
|
|
|
|
# Faster: ViT-B-16 (86M parameters)
|
|
model, _, processor = open_clip.create_model_and_transforms(
|
|
model_name="ViT-B-16",
|
|
pretrained="laion2b_s34b_b79k"
|
|
)
|
|
```
|
|
|
|
### **3. Enable FP16**
|
|
```python
|
|
model = model.half().cuda() # Use half-precision
|
|
image_tensor = processor(image).unsqueeze(0).half().cuda()
|
|
```
|
|
|
|
## Implementation Priority
|
|
|
|
### **Phase 1 (Quick Wins - 80% improvement)**
|
|
1. Reduce text labels from 27 to 10
|
|
2. Use ViT-B-16 model instead of ViT-B-32
|
|
3. Enable half-precision (FP16)
|
|
|
|
### **Phase 2 (Architecture - 10x improvement)**
|
|
1. Implement batch processing
|
|
2. Create persistent classification service
|
|
3. Precompute text features
|
|
|
|
### **Phase 3 (Advanced - 20x improvement)**
|
|
1. Asynchronous processing
|
|
2. Model quantization
|
|
3. GPU memory optimization
|
|
|
|
## Expected Performance Gains
|
|
|
|
| Optimization | Current Time | Optimized Time | Improvement |
|
|
|-------------|--------------|----------------|-------------|
|
|
| Baseline | 3-5 seconds/image | 3-5 seconds/image | 1x |
|
|
| Reduced Labels | 3-5 seconds | 2-3 seconds | 1.5x |
|
|
| Smaller Model | 2-3 seconds | 1-2 seconds | 2x |
|
|
| FP16 | 1-2 seconds | 0.5-1 second | 4x |
|
|
| Batch Processing | 0.5-1 second | 0.1-0.2 seconds | 20x |
|
|
|
|
## Recommended Implementation
|
|
|
|
Start with **Phase 1 optimizations** which can be implemented quickly:
|
|
|
|
1. **Update [`isolated_image_classifier.py`](isolated_image_classifier.py:108)**: Reduce text labels to 10 most relevant
|
|
2. **Change model**: Use ViT-B-16 instead of ViT-B-32
|
|
3. **Enable FP16**: Use half-precision for faster inference
|
|
|
|
These changes should provide 4x performance improvement with minimal code changes while maintaining good accuracy for document processing. |