5.2 KiB
5.2 KiB
OpenCLIP GPU Performance Analysis & Optimization
Performance Issue Analysis
The slow classification performance with OpenCLIP GPU is likely due to several factors:
1. Subprocess Overhead
- Issue: Each image classification starts a new Python subprocess
- Impact: High process creation overhead (100-500ms per image)
- Current Architecture: Subprocess per image for dependency isolation
2. Model Loading Time
- Issue: OpenCLIP model loads from scratch for each subprocess
- Model: ViT-B-32 (151M parameters)
- Load Time: 2-5 seconds per subprocess
- Impact: Significant delay for first image in each batch
3. Small Batch Sizes
- Issue: Processing images one at a time
- GPU Utilization: Low due to small batches
- Memory Transfer: Frequent CPU↔GPU transfers
4. Text Encoding Overhead
- Issue: Text labels re-encoded for each image
- Labels: 27 text labels encoded every time
- Impact: Unnecessary computation repetition
Performance Optimization Strategies
1. Batch Processing
# Current: One image per subprocess
results = classifier.classify_image(image_path)
# Optimized: Multiple images per subprocess
results = classifier.classify_images_batch([image_path1, image_path2, ...])
2. Persistent Subprocess Service
# Long-running classification service
class ClassificationService:
def __init__(self):
self.model = None
self.text_features = None # Precomputed
def start_service(self):
# Load model once, keep in memory
self.model, _, self.processor = open_clip.create_model_and_transforms(...)
self.model = self.model.cuda()
# Precompute text features once
self.text_features = self.encode_text_labels()
def classify_batch(self, image_paths):
# Process multiple images in one GPU call
batch_tensors = [self.processor(Image.open(path)) for path in image_paths]
batch_tensor = torch.stack(batch_tensors).cuda()
with torch.no_grad():
image_features = self.model.encode_image(batch_tensor)
# Use precomputed text_features
similarities = (100.0 * image_features @ self.text_features.T).softmax(dim=-1)
3. Model Optimization
- Use Smaller Model: ViT-B-16 instead of ViT-B-32 (faster, slightly less accurate)
- Quantization: Use half-precision (FP16) for faster inference
- Model Caching: Keep model loaded in GPU memory
4. Asynchronous Processing
async def process_images_async(image_paths):
# Process multiple images concurrently
tasks = [classifier.classify_image_async(path) for path in image_paths]
results = await asyncio.gather(*tasks)
return results
Immediate Quick Wins
1. Reduce Text Labels
# Current: 27 labels
text_labels = [
"a photo of a bee", "a photo of a flower", "a photo of a person",
"a photo of a document", "a photo of a chart", "a photo of a diagram",
# ... 21 more labels
]
# Optimized: 10 most relevant labels
text_labels = [
"a photo of a bee", "a photo of a flower", "a photo of a document",
"a photo of a chart", "a photo of a diagram", "a photo of a table",
"a photo of a graph", "a photo of a screenshot", "a photo of a logo",
"a photo of text"
]
2. Use Smaller Model
# Current: ViT-B-32 (151M parameters)
model, _, processor = open_clip.create_model_and_transforms(
model_name="ViT-B-32",
pretrained="laion2b_s34b_b79k"
)
# Faster: ViT-B-16 (86M parameters)
model, _, processor = open_clip.create_model_and_transforms(
model_name="ViT-B-16",
pretrained="laion2b_s34b_b79k"
)
3. Enable FP16
model = model.half().cuda() # Use half-precision
image_tensor = processor(image).unsqueeze(0).half().cuda()
Implementation Priority
Phase 1 (Quick Wins - 80% improvement)
- Reduce text labels from 27 to 10
- Use ViT-B-16 model instead of ViT-B-32
- Enable half-precision (FP16)
Phase 2 (Architecture - 10x improvement)
- Implement batch processing
- Create persistent classification service
- Precompute text features
Phase 3 (Advanced - 20x improvement)
- Asynchronous processing
- Model quantization
- GPU memory optimization
Expected Performance Gains
| Optimization | Current Time | Optimized Time | Improvement |
|---|---|---|---|
| Baseline | 3-5 seconds/image | 3-5 seconds/image | 1x |
| Reduced Labels | 3-5 seconds | 2-3 seconds | 1.5x |
| Smaller Model | 2-3 seconds | 1-2 seconds | 2x |
| FP16 | 1-2 seconds | 0.5-1 second | 4x |
| Batch Processing | 0.5-1 second | 0.1-0.2 seconds | 20x |
Recommended Implementation
Start with Phase 1 optimizations which can be implemented quickly:
- Update
isolated_image_classifier.py: Reduce text labels to 10 most relevant - Change model: Use ViT-B-16 instead of ViT-B-32
- Enable FP16: Use half-precision for faster inference
These changes should provide 4x performance improvement with minimal code changes while maintaining good accuracy for document processing.