203 lines
6.5 KiB
Markdown
203 lines
6.5 KiB
Markdown
# LightRAG Production System
|
|
|
|
A production-grade RAG system using LightRAG with GPU-accelerated OCR, multi-database storage, and advanced document processing.
|
|
|
|
## System Overview
|
|
|
|
This system provides:
|
|
- **GPU-accelerated OCR** using PaddleOCR with RTX 4070 Super
|
|
- **Multi-database storage** (Redis, Neo4j, Qdrant, PostgreSQL)
|
|
- **Advanced document processing** for PDF, images, Office documents
|
|
- **Snowflake Arctic Embed** for embeddings
|
|
- **DeepSeek API** for LLM functionality
|
|
- **Jina Reranker** for search optimization
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
1. **Install Ollama** and pull required models:
|
|
```bash
|
|
ollama pull snowflake-arctic-embed2:latest
|
|
ollama pull jina-reranker-v2:latest
|
|
ollama pull mistral-nemo:latest
|
|
```
|
|
|
|
2. **Ensure databases are running**:
|
|
- Redis (Memurai) on `localhost:6379`
|
|
- Neo4j on `bolt://localhost:7687` (user: `neo4j`, password: `jleu1212`)
|
|
- Qdrant on `http://localhost:6333`
|
|
- PostgreSQL on `localhost:5432` (database: `rag_anything`, user: `jleu3482`, password: `jleu1212`)
|
|
|
|
3. **Install Python dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Running the System
|
|
|
|
#### Method 1: Using zrun.bat (Windows)
|
|
```bash
|
|
zrun.bat
|
|
```
|
|
|
|
#### Method 2: Manual Start
|
|
```bash
|
|
cd LightRAG-main
|
|
python -m lightrag.api.lightrag_server --port 3015 --host 0.0.0.0 --working-dir rag_storage --input-dir ../inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding ollama --rerank-binding jina
|
|
```
|
|
|
|
### Access the Web UI
|
|
|
|
Open your browser to: http://localhost:3015/webui/
|
|
|
|
**Login Credentials:**
|
|
- Username: `jleu3482`
|
|
- Password: `jleu1212`
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables (`.env`)
|
|
|
|
Key configurations:
|
|
- **LLM**: DeepSeek API with GPU acceleration
|
|
- **Embeddings**: Snowflake Arctic Embed 2 (1024 dimensions)
|
|
- **Reranker**: Jina Reranker v2
|
|
- **Storage**: Redis + Neo4j + Qdrant + PostgreSQL
|
|
- **OCR**: PaddleOCR with GPU acceleration
|
|
|
|
### Performance Settings
|
|
- **Chunk Size**: 1200 tokens
|
|
- **Chunk Overlap**: 100 tokens
|
|
- **Max Parallel Insert**: 4 operations
|
|
- **Max Graph Nodes**: 1000 nodes
|
|
- **GPU Acceleration**: Enabled for OCR and embeddings
|
|
|
|
## Document Processing
|
|
|
|
### Enhanced Processing Pipeline with Dependency Isolation
|
|
|
|
The system now features a sophisticated document processing pipeline with complete dependency isolation between OCR and image classification components:
|
|
|
|
#### Processing Flow:
|
|
1. **Text-First Extraction**: All file types attempt text extraction first
|
|
2. **Image Extraction**: Extract images from documents (PDF, DOCX, images)
|
|
3. **OCR Processing**: PaddleOCR with GPU acceleration for text extraction from images
|
|
4. **Image Classification**: OpenCLIP in isolated environment for image content classification
|
|
5. **Metadata Integration**: Classification results included in searchable content
|
|
6. **Vector Embedding**: Snowflake Arctic Embed 2
|
|
7. **Multi-Database Storage**: Neo4j for knowledge graphs + Qdrant for vector search
|
|
|
|
#### Dependency Isolation Architecture:
|
|
- **PaddleOCR**: Runs in main environment with GPU-only operation
|
|
- **OpenCLIP**: Runs in isolated virtual environment (`openclip_gpu_env`) to avoid dependency conflicts
|
|
- **Persistent Classifier**: Fast GPU classifier with batch processing (16.6x performance improvement)
|
|
|
|
#### Supported File Types
|
|
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR
|
|
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP
|
|
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX
|
|
- **Text Files**: TXT, MD
|
|
- **HTML**: BeautifulSoup4
|
|
|
|
### Image Classification Features
|
|
- **Object Detection**: Identifies objects in images (e.g., "a photo of a bee")
|
|
- **Content Understanding**: Classifies document elements (screenshots, charts, diagrams)
|
|
- **Search Integration**: Classification metadata included in indexed content
|
|
- **High Accuracy**: 100% confidence bee detection in test documents
|
|
|
|
#### Test Document
|
|
Use `test.docx` containing a bee image to verify the complete classification workflow:
|
|
```bash
|
|
python test.py
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Health Check
|
|
```bash
|
|
curl http://localhost:3015/api/health
|
|
```
|
|
|
|
### System Status
|
|
```bash
|
|
curl http://localhost:3015/health
|
|
```
|
|
|
|
### Document Upload
|
|
```bash
|
|
curl -X POST -F "file=@document.pdf" http://localhost:3015/api/upload -H "X-API-Key: jleu1212"
|
|
```
|
|
|
|
### Search
|
|
```bash
|
|
curl -X POST http://localhost:3015/api/search -H "Content-Type: application/json" -H "X-API-Key: jleu1212" -d '{"query": "your search query"}'
|
|
```
|
|
|
|
## GPU Configuration
|
|
|
|
The system is optimized for NVIDIA RTX 4070 Super with:
|
|
- CUDA 11.8
|
|
- cuDNN 8.6.0
|
|
- PaddlePaddle GPU version
|
|
- PyTorch with CUDA support
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **OCR not working**: Check CUDA installation and PaddleOCR GPU support
|
|
2. **Database connection errors**: Verify all databases are running
|
|
3. **Model loading errors**: Ensure Ollama models are downloaded
|
|
4. **GPU memory issues**: Reduce batch sizes in configuration
|
|
|
|
### Logs
|
|
Check `lightrag.log` for detailed processing logs and error information.
|
|
|
|
## Production Validation
|
|
|
|
The system has been validated with comprehensive document processing:
|
|
|
|
### Core Functionality
|
|
- ✅ OCR PDF upload and indexing
|
|
- ✅ Multi-database storage (Redis, Neo4j, Qdrant, PostgreSQL)
|
|
- ✅ GPU-accelerated PaddleOCR with RTX 4070 Super
|
|
- ✅ DeepSeek API integration with regional restriction fix
|
|
- ✅ Snowflake Arctic Embed embeddings (1024 dimensions)
|
|
- ✅ Web UI authentication and document management
|
|
- ✅ Search functionality with OCR content
|
|
|
|
### Advanced Features
|
|
- ✅ Complete dependency isolation between PaddleOCR and OpenCLIP
|
|
- ✅ Image classification with OpenCLIP in isolated environment
|
|
- ✅ Bee image detection at 100% confidence
|
|
- ✅ Classification metadata included in searchable content
|
|
- ✅ Persistent GPU classifier with batch processing
|
|
- ✅ Text-first extraction for all file types
|
|
|
|
### Test Documents
|
|
- Use `ocr.pdf` to test OCR functionality
|
|
- Use `test.docx` containing bee image to test complete classification workflow:
|
|
```bash
|
|
python test.py
|
|
```
|
|
|
|
### Bee Search Verification
|
|
The system successfully indexes and searches for image classification content:
|
|
- Search for "bee" returns documents with bee images
|
|
- Search for "photo of a bee" returns relevant content
|
|
- Image classification metadata is searchable in LightRAG
|
|
|
|
## Performance Optimization
|
|
|
|
- Enable GPU acceleration for OCR and embeddings
|
|
- Use parallel processing for document ingestion
|
|
- Optimize chunk sizes for your document types
|
|
- Monitor database performance and connection pools
|
|
|
|
## Security
|
|
|
|
- API key authentication for upload and search
|
|
- Secure database connections
|
|
- Input validation and sanitization
|
|
- Rate limiting on API endpoints |