220 lines
5.6 KiB
Markdown
220 lines
5.6 KiB
Markdown
# Production-Grade RAG System with LightRAG
|
|
|
|
A high-performance Retrieval-Augmented Generation (RAG) system built on LightRAG framework with multi-format document processing, GPU-accelerated OCR, and production-grade database storage.
|
|
|
|
## System Architecture
|
|
|
|
### Storage Backends
|
|
- **KV Storage**: Redis (Memurai) - `redis://localhost:6379`
|
|
- **Graph Storage**: Neo4j - `bolt://localhost:7687` (username: `neo4j`, password: `jleu1212`)
|
|
- **Vector Storage**: Qdrant - `http://localhost:6333/`
|
|
- **Document Status Storage**: PostgreSQL - `postgresql://jleu3482:jleu1212@localhost:5432/rag_anything`
|
|
|
|
### AI Model Configuration
|
|
- **LLM**: DeepSeek API (API Key: `sk-55f6e57f1d834b0e93ceaf98cc2cb715`)
|
|
- **Embeddings**: Jina AI
|
|
- **Entity Extraction**: spaCy models for fast indexing
|
|
- **Reranker**: Disabled for performance optimization
|
|
- **OCR**: PaddleOCR with GPU acceleration
|
|
|
|
### Performance Settings
|
|
- **Multi-core processing**: Parallel document processing
|
|
- **GPU acceleration**: NVIDIA RTX 4070 Super
|
|
- **Chunking**: 1200 tokens with 100 token overlap
|
|
- **Max graph nodes**: 1000
|
|
- **Max parallel insert**: 4 concurrent operations
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- Python 3.9+
|
|
- Redis (Memurai) running on port 6379
|
|
- Neo4j running on bolt://localhost:7687
|
|
- Qdrant running on http://localhost:6333
|
|
- PostgreSQL running on port 5432 with database `rag_anything`
|
|
|
|
### Install Dependencies
|
|
```bash
|
|
cd LightRAG-main
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Download spaCy Model
|
|
```bash
|
|
python -m spacy download en_core_web_lg
|
|
```
|
|
|
|
### Environment Setup
|
|
Set the following environment variables:
|
|
```bash
|
|
# For Jina embeddings (optional)
|
|
set JINA_API_KEY=your_jina_api_key_here
|
|
|
|
# For DeepSeek LLM
|
|
set DEEPSEEK_API_KEY=sk-55f6e57f1d834b0e93ceaf98cc2cb715
|
|
|
|
# For Windows terminal encoding
|
|
set PYTHONIOENCODING=utf-8
|
|
```
|
|
|
|
## Running the System
|
|
|
|
### Start the Server
|
|
```bash
|
|
python -m lightrag.api.lightrag_server --port 3015 --working-dir rag_storage --input-dir inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding jina --rerank-binding null
|
|
```
|
|
|
|
### Server Parameters
|
|
- `--port 3015`: Web server port
|
|
- `--working-dir rag_storage`: Storage directory for processed data
|
|
- `--input-dir inputs`: Directory for document uploads
|
|
- `--key jleu1212`: Authentication key
|
|
- `--auto-scan-at-startup`: Automatically process documents in input directory
|
|
- `--llm-binding openai`: Use OpenAI-compatible API (DeepSeek)
|
|
- `--embedding-binding jina`: Use Jina embeddings
|
|
- `--rerank-binding null`: Disable reranker for performance
|
|
|
|
## Document Processing Capabilities
|
|
|
|
### Supported Formats
|
|
- **DOCX/DOC**: python-docx + custom table parser
|
|
- **XLSX/XLS**: pandas + openpyxl
|
|
- **PDF (Text-based)**: pymupdf (fitz)
|
|
- **PDF (Image-based)**: paddleocr + layout detection
|
|
- **PPTX/PPT**: python-pptx
|
|
- **Images**: MobileOne classification → PaddleOCR (Fast filter, OCR only when needed)
|
|
- **TXT/CSV**: Direct read
|
|
- **HTML**: beautifulsoup4
|
|
|
|
### OCR Processing
|
|
- Uses PaddleOCR with GPU acceleration
|
|
- MobileOne-S1 model for image classification
|
|
- Automatic detection of scanned pages and images
|
|
- Fast filtering to only OCR when necessary
|
|
|
|
## API Endpoints
|
|
|
|
### Authentication
|
|
```http
|
|
POST /login
|
|
Content-Type: application/x-www-form-urlencoded
|
|
|
|
username=admin&password=jleu1212
|
|
```
|
|
|
|
### Document Management
|
|
```http
|
|
POST /upload
|
|
Content-Type: multipart/form-data
|
|
Authorization: Bearer {token}
|
|
|
|
POST /documents/status
|
|
GET /documents
|
|
```
|
|
|
|
### Search
|
|
```http
|
|
POST /search
|
|
Content-Type: application/json
|
|
Authorization: Bearer {token}
|
|
|
|
{
|
|
"query": "search query",
|
|
"top_k": 10
|
|
}
|
|
```
|
|
|
|
### System Health
|
|
```http
|
|
GET /health
|
|
```
|
|
|
|
## Web UI
|
|
|
|
The system includes a web interface running on port 3015. Access it at:
|
|
```
|
|
http://localhost:3015
|
|
```
|
|
|
|
### Web UI Features
|
|
- Document upload and management
|
|
- Real-time search interface
|
|
- Document processing status
|
|
- System health monitoring
|
|
- Authentication-protected access
|
|
|
|
## Testing
|
|
|
|
### Test Script
|
|
Run the comprehensive test script:
|
|
```bash
|
|
python test_lightrag_webui.py
|
|
```
|
|
|
|
### Test Documents
|
|
Place test documents in the `test_documents/` directory:
|
|
- Text files (.txt)
|
|
- PDF documents (.pdf)
|
|
- Word documents (.docx)
|
|
- Excel files (.xlsx)
|
|
- PowerPoint files (.pptx)
|
|
- Images with text (.jpg, .png)
|
|
|
|
### Health Check
|
|
```bash
|
|
curl http://localhost:3015/health
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### GPU Configuration
|
|
- Uses NVIDIA RTX 4070 Super for OCR and model inference
|
|
- CUDA device: `cuda:0`
|
|
- Batch processing for optimal throughput
|
|
|
|
### Memory Management
|
|
- Parallel processing with configurable worker count
|
|
- Chunked document processing
|
|
- Efficient entity extraction with spaCy
|
|
|
|
### Database Optimization
|
|
- Connection pooling for all databases
|
|
- Batch insert operations
|
|
- Indexed queries for fast retrieval
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Authentication 401 Errors**
|
|
- Ensure login uses form data, not JSON
|
|
- Check password matches the --key parameter
|
|
|
|
2. **OCR Initialization Failures**
|
|
- Verify PaddleOCR version 3.3.0+
|
|
- Check GPU availability and CUDA installation
|
|
|
|
3. **Database Connection Issues**
|
|
- Verify all databases are running
|
|
- Check connection strings in config.ini
|
|
|
|
4. **Unicode Encoding Problems**
|
|
- Set PYTHONIOENCODING=utf-8 environment variable
|
|
|
|
### Logs and Monitoring
|
|
- Server logs output to console
|
|
- Document processing status available via API
|
|
- System health endpoint for monitoring
|
|
|
|
## Configuration
|
|
|
|
Edit `config.ini` for custom settings:
|
|
- Database connection strings
|
|
- Performance parameters
|
|
- OCR and processing settings
|
|
- Server configuration
|
|
|
|
## License
|
|
|
|
This project is built on LightRAG framework. See individual component licenses for details.
|