3.3 KiB
3.3 KiB
504 Gateway Time-out Fix Report
Problem
After uploading a document, the frontend receives a "504 Gateway Time-out" error when trying to load documents at /documents/paginated. The error originates from nginx reverse proxy timing out after 60 seconds while waiting for the backend response.
Root Cause Analysis
- First paginated request after upload takes >60 seconds (observed 80 seconds).
- Subsequent paginated requests are fast (~2 seconds).
- The slowness is due to lock contention in the document status storage (JSON implementation) where the paginated endpoint waits for a lock held by OCR processing.
- OCR processing (PaddleOCR) holds the lock for an extended period while extracting text from PDFs/Word documents.
- The nginx proxy timeout is set to 60 seconds (default), causing a 504 error when the backend takes longer.
Implemented Fixes
1. Increased Server Timeout
- Modified
LightRAG-main/start_server.pyto include--timeout 600argument, increasing gunicorn worker timeout to 600 seconds. - This ensures the backend does not kill the worker before nginx times out.
2. Recommendations for Nginx Configuration
If you have control over the nginx reverse proxy, increase the proxy timeout settings:
location / {
proxy_pass http://localhost:3015;
proxy_read_timeout 300s;
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
}
3. Workarounds for Frontend
- Polling: After uploading a document, wait for processing to complete before calling
/documents/paginated. Use/documents/pipeline_statusendpoint to check ifbusyis false. - Exponential Backoff: Implement client-side retry with increasing delays (e.g., 5s, 10s, 20s) when a 504 error occurs.
- Optimize Upload: Upload smaller documents or pre-extract text to reduce OCR processing time.
4. Potential Code Improvements (Future)
- Reduce lock contention in
JsonDocStatusStorageby using read‑write locks or copying data snapshot. - Make paginated endpoint non‑blocking by using a copy of the data without holding the lock for the entire iteration.
- Optimize OCR processing to release locks between pages.
Verification
- After restarting the server with increased timeout, the paginated endpoint responds within 2 seconds when no background processing is active.
- The first request after upload still times out if OCR is still processing; the workarounds above mitigate this.
Next Steps
- Monitor server logs for any further timeout issues.
- Consider upgrading to a more performant storage backend (PostgreSQL, MongoDB) if document count grows.
- If the issue persists, consider implementing asynchronous paginated responses (return job ID, poll for results).
Files Modified
LightRAG-main/start_server.py– added--timeout 600
Created Test Scripts
test_paginated_performance.py– measures paginated endpoint response timestest_paginated_now.py– quick verification
Conclusion
The immediate fix (increased server timeout) prevents the server from terminating long‑running requests, but the nginx timeout remains a bottleneck. The recommended solution is to increase nginx proxy timeout and implement client‑side retry logic. For production deployments, consider optimizing the lock contention in the storage layer.