jleu3482/railseek6

Fork 0

Files

jleu3482 a09ab4641c Initial commit: LightRAG project with document download and auto-commit

2026-01-11 02:20:47 +08:00

5.2 KiB

Raw Blame History

OpenCLIP GPU Performance Analysis & Optimization

Performance Issue Analysis

The slow classification performance with OpenCLIP GPU is likely due to several factors:

1. Subprocess Overhead

Issue: Each image classification starts a new Python subprocess
Impact: High process creation overhead (100-500ms per image)
Current Architecture: Subprocess per image for dependency isolation

2. Model Loading Time

Issue: OpenCLIP model loads from scratch for each subprocess
Model: ViT-B-32 (151M parameters)
Load Time: 2-5 seconds per subprocess
Impact: Significant delay for first image in each batch

3. Small Batch Sizes

Issue: Processing images one at a time
GPU Utilization: Low due to small batches
Memory Transfer: Frequent CPU↔GPU transfers

4. Text Encoding Overhead

Issue: Text labels re-encoded for each image
Labels: 27 text labels encoded every time
Impact: Unnecessary computation repetition

Performance Optimization Strategies

1. Batch Processing

# Current: One image per subprocess
results = classifier.classify_image(image_path)

# Optimized: Multiple images per subprocess
results = classifier.classify_images_batch([image_path1, image_path2, ...])

2. Persistent Subprocess Service

# Long-running classification service
class ClassificationService:
    def __init__(self):
        self.model = None
        self.text_features = None  # Precomputed
        
    def start_service(self):
        # Load model once, keep in memory
        self.model, _, self.processor = open_clip.create_model_and_transforms(...)
        self.model = self.model.cuda()
        
        # Precompute text features once
        self.text_features = self.encode_text_labels()
    
    def classify_batch(self, image_paths):
        # Process multiple images in one GPU call
        batch_tensors = [self.processor(Image.open(path)) for path in image_paths]
        batch_tensor = torch.stack(batch_tensors).cuda()
        
        with torch.no_grad():
            image_features = self.model.encode_image(batch_tensor)
            # Use precomputed text_features
            similarities = (100.0 * image_features @ self.text_features.T).softmax(dim=-1)

3. Model Optimization

Use Smaller Model: ViT-B-16 instead of ViT-B-32 (faster, slightly less accurate)
Quantization: Use half-precision (FP16) for faster inference
Model Caching: Keep model loaded in GPU memory

4. Asynchronous Processing

async def process_images_async(image_paths):
    # Process multiple images concurrently
    tasks = [classifier.classify_image_async(path) for path in image_paths]
    results = await asyncio.gather(*tasks)
    return results

Immediate Quick Wins

1. Reduce Text Labels

# Current: 27 labels
text_labels = [
    "a photo of a bee", "a photo of a flower", "a photo of a person",
    "a photo of a document", "a photo of a chart", "a photo of a diagram",
    # ... 21 more labels
]

# Optimized: 10 most relevant labels
text_labels = [
    "a photo of a bee", "a photo of a flower", "a photo of a document",
    "a photo of a chart", "a photo of a diagram", "a photo of a table",
    "a photo of a graph", "a photo of a screenshot", "a photo of a logo",
    "a photo of text"
]

2. Use Smaller Model

# Current: ViT-B-32 (151M parameters)
model, _, processor = open_clip.create_model_and_transforms(
    model_name="ViT-B-32",
    pretrained="laion2b_s34b_b79k"
)

# Faster: ViT-B-16 (86M parameters)
model, _, processor = open_clip.create_model_and_transforms(
    model_name="ViT-B-16",
    pretrained="laion2b_s34b_b79k"
)

3. Enable FP16

model = model.half().cuda()  # Use half-precision
image_tensor = processor(image).unsqueeze(0).half().cuda()

Implementation Priority

Phase 1 (Quick Wins - 80% improvement)

Reduce text labels from 27 to 10
Use ViT-B-16 model instead of ViT-B-32
Enable half-precision (FP16)

Phase 2 (Architecture - 10x improvement)

Implement batch processing
Create persistent classification service
Precompute text features

Phase 3 (Advanced - 20x improvement)

Asynchronous processing
Model quantization
GPU memory optimization

Expected Performance Gains

Optimization	Current Time	Optimized Time	Improvement
Baseline	3-5 seconds/image	3-5 seconds/image	1x
Reduced Labels	3-5 seconds	2-3 seconds	1.5x
Smaller Model	2-3 seconds	1-2 seconds	2x
FP16	1-2 seconds	0.5-1 second	4x
Batch Processing	0.5-1 second	0.1-0.2 seconds	20x

Recommended Implementation

Start with Phase 1 optimizations which can be implemented quickly:

Update isolated_image_classifier.py: Reduce text labels to 10 most relevant
Change model: Use ViT-B-16 instead of ViT-B-32
Enable FP16: Use half-precision for faster inference

These changes should provide 4x performance improvement with minimal code changes while maintaining good accuracy for document processing.

5.2 KiB Raw Blame History