Files
railseek6/504_timeout_fix_report.md

57 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 504 Gateway Time-out Fix Report
## Problem
After uploading a document, the frontend receives a "504 Gateway Time-out" error when trying to load documents at `/documents/paginated`. The error originates from nginx reverse proxy timing out after 60 seconds while waiting for the backend response.
## Root Cause Analysis
1. **First paginated request after upload takes >60 seconds** (observed 80 seconds).
2. Subsequent paginated requests are fast (~2 seconds).
3. The slowness is due to lock contention in the document status storage (JSON implementation) where the paginated endpoint waits for a lock held by OCR processing.
4. OCR processing (PaddleOCR) holds the lock for an extended period while extracting text from PDFs/Word documents.
5. The nginx proxy timeout is set to 60 seconds (default), causing a 504 error when the backend takes longer.
## Implemented Fixes
### 1. Increased Server Timeout
- Modified `LightRAG-main/start_server.py` to include `--timeout 600` argument, increasing gunicorn worker timeout to 600 seconds.
- This ensures the backend does not kill the worker before nginx times out.
### 2. Recommendations for Nginx Configuration
If you have control over the nginx reverse proxy, increase the proxy timeout settings:
```nginx
location / {
proxy_pass http://localhost:3015;
proxy_read_timeout 300s;
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
}
```
### 3. Workarounds for Frontend
- **Polling**: After uploading a document, wait for processing to complete before calling `/documents/paginated`. Use `/documents/pipeline_status` endpoint to check if `busy` is false.
- **Exponential Backoff**: Implement client-side retry with increasing delays (e.g., 5s, 10s, 20s) when a 504 error occurs.
- **Optimize Upload**: Upload smaller documents or pre-extract text to reduce OCR processing time.
### 4. Potential Code Improvements (Future)
- Reduce lock contention in `JsonDocStatusStorage` by using readwrite locks or copying data snapshot.
- Make paginated endpoint nonblocking by using a copy of the data without holding the lock for the entire iteration.
- Optimize OCR processing to release locks between pages.
## Verification
- After restarting the server with increased timeout, the paginated endpoint responds within 2 seconds when no background processing is active.
- The first request after upload still times out if OCR is still processing; the workarounds above mitigate this.
## Next Steps
1. Monitor server logs for any further timeout issues.
2. Consider upgrading to a more performant storage backend (PostgreSQL, MongoDB) if document count grows.
3. If the issue persists, consider implementing asynchronous paginated responses (return job ID, poll for results).
## Files Modified
- `LightRAG-main/start_server.py` added `--timeout 600`
## Created Test Scripts
- `test_paginated_performance.py` measures paginated endpoint response times
- `test_paginated_now.py` quick verification
## Conclusion
The immediate fix (increased server timeout) prevents the server from terminating longrunning requests, but the nginx timeout remains a bottleneck. The recommended solution is to increase nginx proxy timeout and implement clientside retry logic. For production deployments, consider optimizing the lock contention in the storage layer.