2026-01-11 18:40:39 +08:00

LightRAG Production System

A production-grade RAG system using LightRAG with GPU-accelerated OCR, multi-database storage, and advanced document processing.

System Overview

This system provides:

  • GPU-accelerated OCR using PaddleOCR with RTX 4070 Super
  • Multi-database storage (Redis, Neo4j, Qdrant, PostgreSQL)
  • Advanced document processing for PDF, images, Office documents
  • Snowflake Arctic Embed for embeddings
  • DeepSeek API for LLM functionality
  • Jina Reranker for search optimization

Quick Start

Prerequisites

  1. Install Ollama and pull required models:
ollama pull snowflake-arctic-embed2:latest
ollama pull jina-reranker-v2:latest
ollama pull mistral-nemo:latest
  1. Ensure databases are running:
  • Redis (Memurai) on localhost:6379
  • Neo4j on bolt://localhost:7687 (user: neo4j, password: jleu1212)
  • Qdrant on http://localhost:6333
  • PostgreSQL on localhost:5432 (database: rag_anything, user: jleu3482, password: jleu1212)
  1. Install Python dependencies:
pip install -r requirements.txt

Running the System

Method 1: Using zrun.bat (Windows)

zrun.bat

Method 2: Manual Start

cd LightRAG-main
python -m lightrag.api.lightrag_server --port 3015 --host 0.0.0.0 --working-dir rag_storage --input-dir ../inputs --key jleu1212 --auto-scan-at-startup --llm-binding openai --embedding-binding ollama --rerank-binding jina

Access the Web UI

Open your browser to: http://localhost:3015/webui/

Login Credentials:

  • Username: jleu3482
  • Password: jleu1212

Configuration

Environment Variables (.env)

Key configurations:

  • LLM: DeepSeek API with GPU acceleration
  • Embeddings: Snowflake Arctic Embed 2 (1024 dimensions)
  • Reranker: Jina Reranker v2
  • Storage: Redis + Neo4j + Qdrant + PostgreSQL
  • OCR: PaddleOCR with GPU acceleration

Performance Settings

  • Chunk Size: 1200 tokens
  • Chunk Overlap: 100 tokens
  • Max Parallel Insert: 4 operations
  • Max Graph Nodes: 1000 nodes
  • GPU Acceleration: Enabled for OCR and embeddings

Document Processing

Enhanced Processing Pipeline with Dependency Isolation

The system now features a sophisticated document processing pipeline with complete dependency isolation between OCR and image classification components:

Processing Flow:

  1. Text-First Extraction: All file types attempt text extraction first
  2. Image Extraction: Extract images from documents (PDF, DOCX, images)
  3. OCR Processing: PaddleOCR with GPU acceleration for text extraction from images
  4. Image Classification: OpenCLIP in isolated environment for image content classification
  5. Metadata Integration: Classification results included in searchable content
  6. Vector Embedding: Snowflake Arctic Embed 2
  7. Multi-Database Storage: Neo4j for knowledge graphs + Qdrant for vector search

Dependency Isolation Architecture:

  • PaddleOCR: Runs in main environment with GPU-only operation
  • OpenCLIP: Runs in isolated virtual environment (openclip_gpu_env) to avoid dependency conflicts
  • Persistent Classifier: Fast GPU classifier with batch processing (16.6x performance improvement)

Supported File Types

  • PDF (text-based and scanned): PyMuPDF + PaddleOCR
  • Images: JPEG, PNG, BMP, TIFF, GIF, WebP
  • Office Documents: DOC, DOCX, PPT, PPTX, XLS, XLSX
  • Text Files: TXT, MD
  • HTML: BeautifulSoup4

Image Classification Features

  • Object Detection: Identifies objects in images (e.g., "a photo of a bee")
  • Content Understanding: Classifies document elements (screenshots, charts, diagrams)
  • Search Integration: Classification metadata included in indexed content
  • High Accuracy: 100% confidence bee detection in test documents

Test Document

Use test.docx containing a bee image to verify the complete classification workflow:

python test.py

API Endpoints

Health Check

curl http://localhost:3015/api/health

System Status

curl http://localhost:3015/health

Document Upload

curl -X POST -F "file=@document.pdf" http://localhost:3015/api/upload -H "X-API-Key: jleu1212"
curl -X POST http://localhost:3015/api/search -H "Content-Type: application/json" -H "X-API-Key: jleu1212" -d '{"query": "your search query"}'

GPU Configuration

The system is optimized for NVIDIA RTX 4070 Super with:

  • CUDA 11.8
  • cuDNN 8.6.0
  • PaddlePaddle GPU version
  • PyTorch with CUDA support

Troubleshooting

Common Issues

  1. OCR not working: Check CUDA installation and PaddleOCR GPU support
  2. Database connection errors: Verify all databases are running
  3. Model loading errors: Ensure Ollama models are downloaded
  4. GPU memory issues: Reduce batch sizes in configuration

Logs

Check lightrag.log for detailed processing logs and error information.

Production Validation

The system has been validated with comprehensive document processing:

Core Functionality

  • OCR PDF upload and indexing
  • Multi-database storage (Redis, Neo4j, Qdrant, PostgreSQL)
  • GPU-accelerated PaddleOCR with RTX 4070 Super
  • DeepSeek API integration with regional restriction fix
  • Snowflake Arctic Embed embeddings (1024 dimensions)
  • Web UI authentication and document management
  • Search functionality with OCR content

Advanced Features

  • Complete dependency isolation between PaddleOCR and OpenCLIP
  • Image classification with OpenCLIP in isolated environment
  • Bee image detection at 100% confidence
  • Classification metadata included in searchable content
  • Persistent GPU classifier with batch processing
  • Text-first extraction for all file types

Test Documents

  • Use ocr.pdf to test OCR functionality
  • Use test.docx containing bee image to test complete classification workflow:
python test.py

Bee Search Verification

The system successfully indexes and searches for image classification content:

  • Search for "bee" returns documents with bee images
  • Search for "photo of a bee" returns relevant content
  • Image classification metadata is searchable in LightRAG

Performance Optimization

  • Enable GPU acceleration for OCR and embeddings
  • Use parallel processing for document ingestion
  • Optimize chunk sizes for your document types
  • Monitor database performance and connection pools

Security

  • API key authentication for upload and search
  • Secure database connections
  • Input validation and sanitization
  • Rate limiting on API endpoints
Description
LightRAG project with document download and auto-commit functionality
Readme 3.1 GiB
Languages
Python 78.4%
C++ 18.4%
C 1.7%
Cuda 0.5%
XSLT 0.3%
Other 0.4%