Files

jleu3482 a09ab4641c Initial commit: LightRAG project with document download and auto-commit

2026-01-11 02:20:47 +08:00

5.6 KiB

Raw Blame History

Production-Grade RAG System with LightRAG

A high-performance Retrieval-Augmented Generation (RAG) system built on LightRAG framework with multi-format document processing, GPU-accelerated OCR, and production-grade database storage.

System Architecture

Storage Backends

KV Storage: Redis (Memurai) - redis://localhost:6379
Graph Storage: Neo4j - bolt://localhost:7687 (username: neo4j, password: jleu1212)
Vector Storage: Qdrant - http://localhost:6333/
Document Status Storage: PostgreSQL - postgresql://jleu3482:jleu1212@localhost:5432/rag_anything

AI Model Configuration

LLM: DeepSeek API (API Key: sk-55f6e57f1d834b0e93ceaf98cc2cb715)
Embeddings: Jina AI
Entity Extraction: spaCy models for fast indexing
Reranker: Disabled for performance optimization
OCR: PaddleOCR with GPU acceleration

Performance Settings

Multi-core processing: Parallel document processing
GPU acceleration: NVIDIA RTX 4070 Super
Chunking: 1200 tokens with 100 token overlap
Max graph nodes: 1000
Max parallel insert: 4 concurrent operations

Installation

Prerequisites

Python 3.9+
Redis (Memurai) running on port 6379
Neo4j running on bolt://localhost:7687
Qdrant running on http://localhost:6333
PostgreSQL running on port 5432 with database rag_anything

Install Dependencies

cd LightRAG-main
pip install -r requirements.txt

Download spaCy Model

python -m spacy download en_core_web_lg

Environment Setup

Set the following environment variables:

# For Jina embeddings (optional)
set JINA_API_KEY=your_jina_api_key_here

# For DeepSeek LLM
set DEEPSEEK_API_KEY=sk-55f6e57f1d834b0e93ceaf98cc2cb715

# For Windows terminal encoding
set PYTHONIOENCODING=utf-8

Running the System

Start the Server

python -m lightrag.api.lightrag_server --port 3015 --working-dir rag_storage --input-dir inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding jina --rerank-binding null

Server Parameters

--port 3015: Web server port
--working-dir rag_storage: Storage directory for processed data
--input-dir inputs: Directory for document uploads
--key jleu1212: Authentication key
--auto-scan-at-startup: Automatically process documents in input directory
--llm-binding openai: Use OpenAI-compatible API (DeepSeek)
--embedding-binding jina: Use Jina embeddings
--rerank-binding null: Disable reranker for performance

Document Processing Capabilities

Supported Formats

DOCX/DOC: python-docx + custom table parser
XLSX/XLS: pandas + openpyxl
PDF (Text-based): pymupdf (fitz)
PDF (Image-based): paddleocr + layout detection
PPTX/PPT: python-pptx
Images: MobileOne classification → PaddleOCR (Fast filter, OCR only when needed)
TXT/CSV: Direct read
HTML: beautifulsoup4

OCR Processing

Uses PaddleOCR with GPU acceleration
MobileOne-S1 model for image classification
Automatic detection of scanned pages and images
Fast filtering to only OCR when necessary

API Endpoints

Authentication

POST /login
Content-Type: application/x-www-form-urlencoded

username=admin&password=jleu1212

Document Management

POST /upload
Content-Type: multipart/form-data
Authorization: Bearer {token}

POST /documents/status
GET /documents

Search

POST /search
Content-Type: application/json
Authorization: Bearer {token}

{
  "query": "search query",
  "top_k": 10
}

System Health

GET /health

Web UI

The system includes a web interface running on port 3015. Access it at:

http://localhost:3015

Web UI Features

Document upload and management
Real-time search interface
Document processing status
System health monitoring
Authentication-protected access

Testing

Test Script

Run the comprehensive test script:

python test_lightrag_webui.py

Test Documents

Place test documents in the test_documents/ directory:

Text files (.txt)
PDF documents (.pdf)
Word documents (.docx)
Excel files (.xlsx)
PowerPoint files (.pptx)
Images with text (.jpg, .png)

Health Check

curl http://localhost:3015/health

Performance Optimization

GPU Configuration

Uses NVIDIA RTX 4070 Super for OCR and model inference
CUDA device: cuda:0
Batch processing for optimal throughput

Memory Management

Parallel processing with configurable worker count
Chunked document processing
Efficient entity extraction with spaCy

Database Optimization

Connection pooling for all databases
Batch insert operations
Indexed queries for fast retrieval

Troubleshooting

Common Issues

Authentication 401 Errors
- Ensure login uses form data, not JSON
- Check password matches the --key parameter
OCR Initialization Failures
- Verify PaddleOCR version 3.3.0+
- Check GPU availability and CUDA installation
Database Connection Issues
- Verify all databases are running
- Check connection strings in config.ini
Unicode Encoding Problems
- Set PYTHONIOENCODING=utf-8 environment variable

Logs and Monitoring

Server logs output to console
Document processing status available via API
System health endpoint for monitoring

Configuration

Edit config.ini for custom settings:

Database connection strings
Performance parameters
OCR and processing settings
Server configuration

License

This project is built on LightRAG framework. See individual component licenses for details.

5.6 KiB Raw Blame History