table detection enhanced

This commit is contained in:
2026-01-14 15:15:01 +08:00
parent e7256a10ea
commit 1838c37302
14 changed files with 18065490 additions and 71 deletions

View File

@@ -94,11 +94,50 @@ The system now features a sophisticated document processing pipeline with comple
- **Persistent Classifier**: Fast GPU classifier with batch processing (16.6x performance improvement)
#### Supported File Types
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX
- **Text Files**: TXT, MD
- **HTML**: BeautifulSoup4
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR + Tabula table extraction
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP (OCR with table detection)
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX (native table extraction)
- **Text Files**: TXT, MD (text pattern table detection)
- **HTML**: BeautifulSoup4 (table extraction from HTML tables)
### Table Extraction Capabilities
LightRAG now features enhanced table recognition with a hybrid approach for optimal speed and accuracy:
#### Table Extraction Methods
1. **Tabula Integration** (Digital PDFs):
- Extracts tables from PDFs with text layers using Tabula library
- Supports both lattice (bordered) and stream (borderless) table detection
- Non-AI approach with excellent accuracy for digital PDFs
- Fast processing with direct PDF parsing
2. **Enhanced OCR Heuristics** (Scanned Documents):
- Advanced layout analysis of OCR bounding boxes
- Adaptive row grouping based on text height and vertical alignment
- Column clustering with dynamic threshold detection
- Header row detection and table boundary validation
- Non-AI approach optimized for scanned documents and images
3. **Text Pattern Detection** (Simple Tables):
- Detects pipe (`|`) and tab-separated tables in text content
- Identifies table-like structures in plain text documents
#### Hybrid Processing Strategy
- **Digital PDFs**: Tabula extraction first, fallback to text pattern detection
- **Scanned PDFs**: OCR with enhanced heuristic table detection
- **Images**: OCR-based table extraction only
- **Office Documents**: Native table extraction from DOCX/XLSX formats
#### Performance Characteristics
- **Non-AI Methods**: All table extraction methods are non-AI, ensuring fast processing
- **Speed**: Tabula extraction is near-instant for digital PDFs; OCR heuristics add minimal overhead
- **Accuracy**: High accuracy for digital PDFs with Tabula; good accuracy for scanned documents with enhanced heuristics
- **Integration**: Extracted tables are included in searchable content and metadata
#### Configuration
- Tabula is automatically used when available (requires `tabula-py>=2.8.0`)
- Enhanced OCR heuristics are enabled by default in the optimized OCR processor
- Table extraction can be disabled via configuration if needed
### Image Classification Features
- **Object Detection**: Identifies objects in images (e.g., "a photo of a bee")