# Production-Grade RAG System with LightRAG

A high-performance Retrieval-Augmented Generation (RAG) system built on LightRAG framework with multi-format document processing, GPU-accelerated OCR, and production-grade database storage.

## System Architecture

### Storage Backends
- **KV Storage**: Redis (Memurai) - `redis://localhost:6379`
- **Graph Storage**: Neo4j - `bolt://localhost:7687` (username: `neo4j`, password: `jleu1212`)
- **Vector Storage**: Qdrant - `http://localhost:6333/`
- **Document Status Storage**: PostgreSQL - `postgresql://jleu3482:jleu1212@localhost:5432/rag_anything`

### AI Model Configuration
- **LLM**: DeepSeek API (API Key: `sk-55f6e57f1d834b0e93ceaf98cc2cb715`)
- **Embeddings**: Jina AI
- **Entity Extraction**: spaCy models for fast indexing
- **Reranker**: Disabled for performance optimization
- **OCR**: PaddleOCR with GPU acceleration

### Performance Settings
- **Multi-core processing**: Parallel document processing
- **GPU acceleration**: NVIDIA RTX 4070 Super
- **Chunking**: 1200 tokens with 100 token overlap
- **Max graph nodes**: 1000
- **Max parallel insert**: 4 concurrent operations

## Installation

### Prerequisites
- Python 3.9+
- Redis (Memurai) running on port 6379
- Neo4j running on bolt://localhost:7687
- Qdrant running on http://localhost:6333
- PostgreSQL running on port 5432 with database `rag_anything`

### Install Dependencies
```bash
cd LightRAG-main
pip install -r requirements.txt
```

### Download spaCy Model
```bash
python -m spacy download en_core_web_lg
```

### Environment Setup
Set the following environment variables:
```bash
# For Jina embeddings (optional)
set JINA_API_KEY=your_jina_api_key_here

# For DeepSeek LLM
set DEEPSEEK_API_KEY=sk-55f6e57f1d834b0e93ceaf98cc2cb715

# For Windows terminal encoding
set PYTHONIOENCODING=utf-8
```

## Running the System

### Start the Server
```bash
python -m lightrag.api.lightrag_server --port 3015 --working-dir rag_storage --input-dir inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding jina --rerank-binding null
```

### Server Parameters
- `--port 3015`: Web server port
- `--working-dir rag_storage`: Storage directory for processed data
- `--input-dir inputs`: Directory for document uploads
- `--key jleu1212`: Authentication key
- `--auto-scan-at-startup`: Automatically process documents in input directory
- `--llm-binding openai`: Use OpenAI-compatible API (DeepSeek)
- `--embedding-binding jina`: Use Jina embeddings
- `--rerank-binding null`: Disable reranker for performance

## Document Processing Capabilities

### Supported Formats
- **DOCX/DOC**: python-docx + custom table parser
- **XLSX/XLS**: pandas + openpyxl
- **PDF (Text-based)**: pymupdf (fitz)
- **PDF (Image-based)**: paddleocr + layout detection
- **PPTX/PPT**: python-pptx
- **Images**: MobileOne classification → PaddleOCR (Fast filter, OCR only when needed)
- **TXT/CSV**: Direct read
- **HTML**: beautifulsoup4

### OCR Processing
- Uses PaddleOCR with GPU acceleration
- MobileOne-S1 model for image classification
- Automatic detection of scanned pages and images
- Fast filtering to only OCR when necessary

## API Endpoints

### Authentication
```http
POST /login
Content-Type: application/x-www-form-urlencoded

username=admin&password=jleu1212
```

### Document Management
```http
POST /upload
Content-Type: multipart/form-data
Authorization: Bearer {token}

POST /documents/status
GET /documents
```

### Search
```http
POST /search
Content-Type: application/json
Authorization: Bearer {token}

{
  "query": "search query",
  "top_k": 10
}
```

### System Health
```http
GET /health
```

## Web UI

The system includes a web interface running on port 3015. Access it at:
```
http://localhost:3015
```

### Web UI Features
- Document upload and management
- Real-time search interface
- Document processing status
- System health monitoring
- Authentication-protected access

## Testing

### Test Script
Run the comprehensive test script:
```bash
python test_lightrag_webui.py
```

### Test Documents
Place test documents in the `test_documents/` directory:
- Text files (.txt)
- PDF documents (.pdf)
- Word documents (.docx)
- Excel files (.xlsx)
- PowerPoint files (.pptx)
- Images with text (.jpg, .png)

### Health Check
```bash
curl http://localhost:3015/health
```

## Performance Optimization

### GPU Configuration
- Uses NVIDIA RTX 4070 Super for OCR and model inference
- CUDA device: `cuda:0`
- Batch processing for optimal throughput

### Memory Management
- Parallel processing with configurable worker count
- Chunked document processing
- Efficient entity extraction with spaCy

### Database Optimization
- Connection pooling for all databases
- Batch insert operations
- Indexed queries for fast retrieval

## Troubleshooting

### Common Issues

1. **Authentication 401 Errors**
   - Ensure login uses form data, not JSON
   - Check password matches the --key parameter

2. **OCR Initialization Failures**
   - Verify PaddleOCR version 3.3.0+
   - Check GPU availability and CUDA installation

3. **Database Connection Issues**
   - Verify all databases are running
   - Check connection strings in config.ini

4. **Unicode Encoding Problems**
   - Set PYTHONIOENCODING=utf-8 environment variable

### Logs and Monitoring
- Server logs output to console
- Document processing status available via API
- System health endpoint for monitoring

## Configuration

Edit `config.ini` for custom settings:
- Database connection strings
- Performance parameters
- OCR and processing settings
- Server configuration

## License

This project is built on LightRAG framework. See individual component licenses for details.