agc-chatbot/cf-plan.md
2025-06-04 15:04:53 +08:00

179 lines
4.4 KiB
Markdown

# Cross-Reference Tab Implementation Plan
## Overview
Create a standalone tab that allows users to upload a document, process it, and find related documents in the existing database.
## Current System Analysis
### Backend (FastAPI)
-`/search` endpoint exists - can find related documents
-`/documents` endpoint exists - can retrieve documents
- ❌ No document upload endpoint
- ❌ No document processing for uploaded files
### Frontend
- ✅ Tab system exists
- ✅ Basic cross-reference function exists (hardcoded)
- ❌ No file upload functionality
- ❌ No dedicated cross-reference tab
## Implementation Plan
### Phase 1: Backend API Extensions
#### New Endpoints Needed
1. **`POST /upload-document`**
- Accept file upload (PDF, DOC, TXT)
- Extract text content from uploaded file
- Return processed text and document metadata
- **No database storage** - temporary processing only
2. **`POST /find-cross-references`**
- Accept processed document text
- Use existing search functionality internally
- Return related documents with similarity scores
- Include cross-reference analysis
#### Leverage Existing APIs
- Use existing `/search` endpoint logic for finding related documents
- Use existing `/documents` endpoint to fetch full related documents
- Use existing database connection and document retrieval functions
### Phase 2: Frontend Implementation
#### New Tab Structure
1. **Upload Section**
- File drop zone
- File type validation (PDF, DOC, DOCX, TXT)
- Upload progress indicator
- File preview/summary
2. **Processing Section**
- Processing status indicator
- Document analysis summary
- Key terms extraction display
3. **Results Section**
- Related documents list
- Similarity scores
- Cross-reference details
- Document preview capability
#### UI Components Needed
- File upload widget
- Progress bars
- Results grid/list
- Document preview modal
- Cross-reference visualization
### Phase 3: Processing Logic
#### Document Processing Pipeline
1. **File Upload & Validation**
- Validate file type and size
- Extract text content using appropriate libraries
- Clean and normalize text
2. **Content Analysis**
- Extract key terms and phrases
- Identify legal concepts
- Generate search queries from content
3. **Cross-Reference Matching**
- Use existing search service (enhanced_rag_service or simple_search_service)
- Multiple search strategies:
- Full text similarity
- Key terms matching
- Legal concept matching
- Rank results by relevance
4. **Results Processing**
- Format cross-reference results
- Include similarity metrics
- Group by document type or relevance
## Technical Approach
### Backend Dependencies
```python
# New libraries needed
- python-multipart # For file uploads
- PyPDF2 or pdfplumber # PDF text extraction
- python-docx # Word document processing
```
### API Strategy
**Recommendation: Create new endpoints** because:
- Current `/search` expects a text query, not document content
- Need specialized document processing logic
- Need different response format for cross-references
- Upload functionality is entirely new
### Frontend Strategy
- Add new tab to existing tab system
- Use existing styling and components where possible
- Implement file upload using HTML5 File API
- Use existing API calling patterns
## File Structure
### New Backend Files
```
embedding/
├── document_processor.py # Handle file uploads and text extraction
├── cross_reference_service.py # Cross-reference logic
```
### New Frontend Components
```
frontend/
├── js/
│ ├── cross-reference.js # Cross-reference tab logic
│ └── file-upload.js # File upload utilities
├── css/
│ └── cross-reference.css # Specific styling
```
### API Endpoints Summary
1. **`POST /upload-document`** - New endpoint needed
2. **`POST /find-cross-references`** - New endpoint needed
3. **`GET /documents`** - Use existing
4. **`GET /documents/{id}`** - Use existing
## Development Priority
1. Backend document upload and processing
2. Cross-reference matching logic
3. Frontend tab and upload interface
4. Results display and formatting
5. Error handling and validation
## Benefits of This Approach
- Leverages existing search infrastructure
- Maintains separation of concerns
- Scalable and maintainable
- Consistent with current API patterns
- No database changes needed