Files
railseek6/LightRAG-main/README.md

220 lines
5.6 KiB
Markdown

# Production-Grade RAG System with LightRAG
A high-performance Retrieval-Augmented Generation (RAG) system built on LightRAG framework with multi-format document processing, GPU-accelerated OCR, and production-grade database storage.
## System Architecture
### Storage Backends
- **KV Storage**: Redis (Memurai) - `redis://localhost:6379`
- **Graph Storage**: Neo4j - `bolt://localhost:7687` (username: `neo4j`, password: `jleu1212`)
- **Vector Storage**: Qdrant - `http://localhost:6333/`
- **Document Status Storage**: PostgreSQL - `postgresql://jleu3482:jleu1212@localhost:5432/rag_anything`
### AI Model Configuration
- **LLM**: DeepSeek API (API Key: `sk-55f6e57f1d834b0e93ceaf98cc2cb715`)
- **Embeddings**: Jina AI
- **Entity Extraction**: spaCy models for fast indexing
- **Reranker**: Disabled for performance optimization
- **OCR**: PaddleOCR with GPU acceleration
### Performance Settings
- **Multi-core processing**: Parallel document processing
- **GPU acceleration**: NVIDIA RTX 4070 Super
- **Chunking**: 1200 tokens with 100 token overlap
- **Max graph nodes**: 1000
- **Max parallel insert**: 4 concurrent operations
## Installation
### Prerequisites
- Python 3.9+
- Redis (Memurai) running on port 6379
- Neo4j running on bolt://localhost:7687
- Qdrant running on http://localhost:6333
- PostgreSQL running on port 5432 with database `rag_anything`
### Install Dependencies
```bash
cd LightRAG-main
pip install -r requirements.txt
```
### Download spaCy Model
```bash
python -m spacy download en_core_web_lg
```
### Environment Setup
Set the following environment variables:
```bash
# For Jina embeddings (optional)
set JINA_API_KEY=your_jina_api_key_here
# For DeepSeek LLM
set DEEPSEEK_API_KEY=sk-55f6e57f1d834b0e93ceaf98cc2cb715
# For Windows terminal encoding
set PYTHONIOENCODING=utf-8
```
## Running the System
### Start the Server
```bash
python -m lightrag.api.lightrag_server --port 3015 --working-dir rag_storage --input-dir inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding jina --rerank-binding null
```
### Server Parameters
- `--port 3015`: Web server port
- `--working-dir rag_storage`: Storage directory for processed data
- `--input-dir inputs`: Directory for document uploads
- `--key jleu1212`: Authentication key
- `--auto-scan-at-startup`: Automatically process documents in input directory
- `--llm-binding openai`: Use OpenAI-compatible API (DeepSeek)
- `--embedding-binding jina`: Use Jina embeddings
- `--rerank-binding null`: Disable reranker for performance
## Document Processing Capabilities
### Supported Formats
- **DOCX/DOC**: python-docx + custom table parser
- **XLSX/XLS**: pandas + openpyxl
- **PDF (Text-based)**: pymupdf (fitz)
- **PDF (Image-based)**: paddleocr + layout detection
- **PPTX/PPT**: python-pptx
- **Images**: MobileOne classification → PaddleOCR (Fast filter, OCR only when needed)
- **TXT/CSV**: Direct read
- **HTML**: beautifulsoup4
### OCR Processing
- Uses PaddleOCR with GPU acceleration
- MobileOne-S1 model for image classification
- Automatic detection of scanned pages and images
- Fast filtering to only OCR when necessary
## API Endpoints
### Authentication
```http
POST /login
Content-Type: application/x-www-form-urlencoded
username=admin&password=jleu1212
```
### Document Management
```http
POST /upload
Content-Type: multipart/form-data
Authorization: Bearer {token}
POST /documents/status
GET /documents
```
### Search
```http
POST /search
Content-Type: application/json
Authorization: Bearer {token}
{
"query": "search query",
"top_k": 10
}
```
### System Health
```http
GET /health
```
## Web UI
The system includes a web interface running on port 3015. Access it at:
```
http://localhost:3015
```
### Web UI Features
- Document upload and management
- Real-time search interface
- Document processing status
- System health monitoring
- Authentication-protected access
## Testing
### Test Script
Run the comprehensive test script:
```bash
python test_lightrag_webui.py
```
### Test Documents
Place test documents in the `test_documents/` directory:
- Text files (.txt)
- PDF documents (.pdf)
- Word documents (.docx)
- Excel files (.xlsx)
- PowerPoint files (.pptx)
- Images with text (.jpg, .png)
### Health Check
```bash
curl http://localhost:3015/health
```
## Performance Optimization
### GPU Configuration
- Uses NVIDIA RTX 4070 Super for OCR and model inference
- CUDA device: `cuda:0`
- Batch processing for optimal throughput
### Memory Management
- Parallel processing with configurable worker count
- Chunked document processing
- Efficient entity extraction with spaCy
### Database Optimization
- Connection pooling for all databases
- Batch insert operations
- Indexed queries for fast retrieval
## Troubleshooting
### Common Issues
1. **Authentication 401 Errors**
- Ensure login uses form data, not JSON
- Check password matches the --key parameter
2. **OCR Initialization Failures**
- Verify PaddleOCR version 3.3.0+
- Check GPU availability and CUDA installation
3. **Database Connection Issues**
- Verify all databases are running
- Check connection strings in config.ini
4. **Unicode Encoding Problems**
- Set PYTHONIOENCODING=utf-8 environment variable
### Logs and Monitoring
- Server logs output to console
- Document processing status available via API
- System health endpoint for monitoring
## Configuration
Edit `config.ini` for custom settings:
- Database connection strings
- Performance parameters
- OCR and processing settings
- Server configuration
## License
This project is built on LightRAG framework. See individual component licenses for details.