04ef89fd424d58ee5b534ee4d202d5df068f37d7
LightRAG Production System
A production-grade RAG system using LightRAG with GPU-accelerated OCR, multi-database storage, and advanced document processing.
System Overview
This system provides:
- GPU-accelerated OCR using PaddleOCR with RTX 4070 Super
- Multi-database storage (Redis, Neo4j, Qdrant, PostgreSQL)
- Advanced document processing for PDF, images, Office documents
- Snowflake Arctic Embed for embeddings
- DeepSeek API for LLM functionality
- Jina Reranker for search optimization
Quick Start
Prerequisites
- Install Ollama and pull required models:
ollama pull snowflake-arctic-embed2:latest
ollama pull jina-reranker-v2:latest
ollama pull mistral-nemo:latest
- Ensure databases are running:
- Redis (Memurai) on
localhost:6379 - Neo4j on
bolt://localhost:7687(user:neo4j, password:jleu1212) - Qdrant on
http://localhost:6333 - PostgreSQL on
localhost:5432(database:rag_anything, user:jleu3482, password:jleu1212)
- Install Python dependencies:
pip install -r requirements.txt
Running the System
Method 1: Using zrun.bat (Windows)
zrun.bat
Method 2: Manual Start
cd LightRAG-main
python -m lightrag.api.lightrag_server --port 3015 --host 0.0.0.0 --working-dir rag_storage --input-dir ../inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding ollama --rerank-binding jina
Access the Web UI
Open your browser to: http://localhost:3015/webui/
Login Credentials:
- Username:
jleu3482 - Password:
jleu1212
Configuration
Environment Variables (.env)
Key configurations:
- LLM: DeepSeek API with GPU acceleration
- Embeddings: Snowflake Arctic Embed 2 (1024 dimensions)
- Reranker: Jina Reranker v2
- Storage: Redis + Neo4j + Qdrant + PostgreSQL
- OCR: PaddleOCR with GPU acceleration
Performance Settings
- Chunk Size: 1200 tokens
- Chunk Overlap: 100 tokens
- Max Parallel Insert: 4 operations
- Max Graph Nodes: 1000 nodes
- GPU Acceleration: Enabled for OCR and embeddings
Document Processing
Enhanced Processing Pipeline with Dependency Isolation
The system now features a sophisticated document processing pipeline with complete dependency isolation between OCR and image classification components:
Processing Flow:
- Text-First Extraction: All file types attempt text extraction first
- Image Extraction: Extract images from documents (PDF, DOCX, images)
- OCR Processing: PaddleOCR with GPU acceleration for text extraction from images
- Image Classification: OpenCLIP in isolated environment for image content classification
- Metadata Integration: Classification results included in searchable content
- Vector Embedding: Snowflake Arctic Embed 2
- Multi-Database Storage: Neo4j for knowledge graphs + Qdrant for vector search
Dependency Isolation Architecture:
- PaddleOCR: Runs in main environment with GPU-only operation
- OpenCLIP: Runs in isolated virtual environment (
openclip_gpu_env) to avoid dependency conflicts - Persistent Classifier: Fast GPU classifier with batch processing (16.6x performance improvement)
Supported File Types
- PDF (text-based and scanned): PyMuPDF + PaddleOCR
- Images: JPEG, PNG, BMP, TIFF, GIF, WebP
- Office Documents: DOC, DOCX, PPT, PPTX, XLS, XLSX
- Text Files: TXT, MD
- HTML: BeautifulSoup4
Image Classification Features
- Object Detection: Identifies objects in images (e.g., "a photo of a bee")
- Content Understanding: Classifies document elements (screenshots, charts, diagrams)
- Search Integration: Classification metadata included in indexed content
- High Accuracy: 100% confidence bee detection in test documents
Test Document
Use test.docx containing a bee image to verify the complete classification workflow:
python test.py
API Endpoints
Health Check
curl http://localhost:3015/api/health
System Status
curl http://localhost:3015/health
Document Upload
curl -X POST -F "file=@document.pdf" http://localhost:3015/api/upload -H "X-API-Key: jleu1212"
Search
curl -X POST http://localhost:3015/api/search -H "Content-Type: application/json" -H "X-API-Key: jleu1212" -d '{"query": "your search query"}'
GPU Configuration
The system is optimized for NVIDIA RTX 4070 Super with:
- CUDA 11.8
- cuDNN 8.6.0
- PaddlePaddle GPU version
- PyTorch with CUDA support
Troubleshooting
Common Issues
- OCR not working: Check CUDA installation and PaddleOCR GPU support
- Database connection errors: Verify all databases are running
- Model loading errors: Ensure Ollama models are downloaded
- GPU memory issues: Reduce batch sizes in configuration
Logs
Check lightrag.log for detailed processing logs and error information.
Production Validation
The system has been validated with comprehensive document processing:
Core Functionality
- ✅ OCR PDF upload and indexing
- ✅ Multi-database storage (Redis, Neo4j, Qdrant, PostgreSQL)
- ✅ GPU-accelerated PaddleOCR with RTX 4070 Super
- ✅ DeepSeek API integration with regional restriction fix
- ✅ Snowflake Arctic Embed embeddings (1024 dimensions)
- ✅ Web UI authentication and document management
- ✅ Search functionality with OCR content
Advanced Features
- ✅ Complete dependency isolation between PaddleOCR and OpenCLIP
- ✅ Image classification with OpenCLIP in isolated environment
- ✅ Bee image detection at 100% confidence
- ✅ Classification metadata included in searchable content
- ✅ Persistent GPU classifier with batch processing
- ✅ Text-first extraction for all file types
Test Documents
- Use
ocr.pdfto test OCR functionality - Use
test.docxcontaining bee image to test complete classification workflow:
python test.py
Bee Search Verification
The system successfully indexes and searches for image classification content:
- Search for "bee" returns documents with bee images
- Search for "photo of a bee" returns relevant content
- Image classification metadata is searchable in LightRAG
Performance Optimization
- Enable GPU acceleration for OCR and embeddings
- Use parallel processing for document ingestion
- Optimize chunk sizes for your document types
- Monitor database performance and connection pools
Security
- API key authentication for upload and search
- Secure database connections
- Input validation and sanitization
- Rate limiting on API endpoints
Description
Languages
Python
78.4%
C++
18.4%
C
1.7%
Cuda
0.5%
XSLT
0.3%
Other
0.4%