table detection enhanced
This commit is contained in:
49
README.md
49
README.md
@@ -94,11 +94,50 @@ The system now features a sophisticated document processing pipeline with comple
|
||||
- **Persistent Classifier**: Fast GPU classifier with batch processing (16.6x performance improvement)
|
||||
|
||||
#### Supported File Types
|
||||
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR
|
||||
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP
|
||||
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX
|
||||
- **Text Files**: TXT, MD
|
||||
- **HTML**: BeautifulSoup4
|
||||
- **PDF** (text-based and scanned): PyMuPDF + PaddleOCR + Tabula table extraction
|
||||
- **Images**: JPEG, PNG, BMP, TIFF, GIF, WebP (OCR with table detection)
|
||||
- **Office Documents**: DOC, DOCX, PPT, PPTX, XLS, XLSX (native table extraction)
|
||||
- **Text Files**: TXT, MD (text pattern table detection)
|
||||
- **HTML**: BeautifulSoup4 (table extraction from HTML tables)
|
||||
|
||||
### Table Extraction Capabilities
|
||||
|
||||
LightRAG now features enhanced table recognition with a hybrid approach for optimal speed and accuracy:
|
||||
|
||||
#### Table Extraction Methods
|
||||
1. **Tabula Integration** (Digital PDFs):
|
||||
- Extracts tables from PDFs with text layers using Tabula library
|
||||
- Supports both lattice (bordered) and stream (borderless) table detection
|
||||
- Non-AI approach with excellent accuracy for digital PDFs
|
||||
- Fast processing with direct PDF parsing
|
||||
|
||||
2. **Enhanced OCR Heuristics** (Scanned Documents):
|
||||
- Advanced layout analysis of OCR bounding boxes
|
||||
- Adaptive row grouping based on text height and vertical alignment
|
||||
- Column clustering with dynamic threshold detection
|
||||
- Header row detection and table boundary validation
|
||||
- Non-AI approach optimized for scanned documents and images
|
||||
|
||||
3. **Text Pattern Detection** (Simple Tables):
|
||||
- Detects pipe (`|`) and tab-separated tables in text content
|
||||
- Identifies table-like structures in plain text documents
|
||||
|
||||
#### Hybrid Processing Strategy
|
||||
- **Digital PDFs**: Tabula extraction first, fallback to text pattern detection
|
||||
- **Scanned PDFs**: OCR with enhanced heuristic table detection
|
||||
- **Images**: OCR-based table extraction only
|
||||
- **Office Documents**: Native table extraction from DOCX/XLSX formats
|
||||
|
||||
#### Performance Characteristics
|
||||
- **Non-AI Methods**: All table extraction methods are non-AI, ensuring fast processing
|
||||
- **Speed**: Tabula extraction is near-instant for digital PDFs; OCR heuristics add minimal overhead
|
||||
- **Accuracy**: High accuracy for digital PDFs with Tabula; good accuracy for scanned documents with enhanced heuristics
|
||||
- **Integration**: Extracted tables are included in searchable content and metadata
|
||||
|
||||
#### Configuration
|
||||
- Tabula is automatically used when available (requires `tabula-py>=2.8.0`)
|
||||
- Enhanced OCR heuristics are enabled by default in the optimized OCR processor
|
||||
- Table extraction can be disabled via configuration if needed
|
||||
|
||||
### Image Classification Features
|
||||
- **Object Detection**: Identifies objects in images (e.g., "a photo of a bee")
|
||||
|
||||
Reference in New Issue
Block a user