table detection enhanced

2026-01-14 15:15:01 +08:00
parent e7256a10ea
commit 1838c37302
14 changed files with 18065490 additions and 71 deletions
--- a/README.md
+++ b/README.md
@@ -94,11 +94,50 @@ The system now features a sophisticated document processing pipeline with comple
 - **Persistent Classifier**: Fast GPU classifier with batch processing (16.6x performance improvement)

 #### Supported File Types
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX
- **Text Files**: TXT, MD
- **HTML**: BeautifulSoup4
+- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR + Tabula table extraction
+- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP (OCR with table detection)
+- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX (native table extraction)
+- **Text Files**: TXT, MD (text pattern table detection)
+- **HTML**: BeautifulSoup4 (table extraction from HTML tables)
+
+### Table Extraction Capabilities
+
+LightRAG now features enhanced table recognition with a hybrid approach for optimal speed and accuracy:
+
+#### Table Extraction Methods
+1. **Tabula Integration** (Digital PDFs):
+   - Extracts tables from PDFs with text layers using Tabula library
+   - Supports both lattice (bordered) and stream (borderless) table detection
+   - Non-AI approach with excellent accuracy for digital PDFs
+   - Fast processing with direct PDF parsing
+
+2. **Enhanced OCR Heuristics** (Scanned Documents):
+   - Advanced layout analysis of OCR bounding boxes
+   - Adaptive row grouping based on text height and vertical alignment
+   - Column clustering with dynamic threshold detection
+   - Header row detection and table boundary validation
+   - Non-AI approach optimized for scanned documents and images
+
+3. **Text Pattern Detection** (Simple Tables):
+   - Detects pipe (`|`) and tab-separated tables in text content
+   - Identifies table-like structures in plain text documents
+
+#### Hybrid Processing Strategy
+- **Digital PDFs**: Tabula extraction first, fallback to text pattern detection
+- **Scanned PDFs**: OCR with enhanced heuristic table detection
+- **Images**: OCR-based table extraction only
+- **Office Documents**: Native table extraction from DOCX/XLSX formats
+
+#### Performance Characteristics
+- **Non-AI Methods**: All table extraction methods are non-AI, ensuring fast processing
+- **Speed**: Tabula extraction is near-instant for digital PDFs; OCR heuristics add minimal overhead
+- **Accuracy**: High accuracy for digital PDFs with Tabula; good accuracy for scanned documents with enhanced heuristics
+- **Integration**: Extracted tables are included in searchable content and metadata
+
+#### Configuration
+- Tabula is automatically used when available (requires `tabula-py>=2.8.0`)
+- Enhanced OCR heuristics are enabled by default in the optimized OCR processor
+- Table extraction can be disabled via configuration if needed

 ### Image Classification Features
 - **Object Detection**: Identifies objects in images (e.g., "a photo of a bee")