agc-chatbot/cf-plan.md
2025-06-04 15:04:53 +08:00

4.4 KiB

Cross-Reference Tab Implementation Plan

Overview

Create a standalone tab that allows users to upload a document, process it, and find related documents in the existing database.

Current System Analysis

Backend (FastAPI)

  • /search endpoint exists - can find related documents
  • /documents endpoint exists - can retrieve documents
  • No document upload endpoint
  • No document processing for uploaded files

Frontend

  • Tab system exists
  • Basic cross-reference function exists (hardcoded)
  • No file upload functionality
  • No dedicated cross-reference tab

Implementation Plan

Phase 1: Backend API Extensions

New Endpoints Needed

  1. POST /upload-document

    • Accept file upload (PDF, DOC, TXT)
    • Extract text content from uploaded file
    • Return processed text and document metadata
    • No database storage - temporary processing only
  2. POST /find-cross-references

    • Accept processed document text
    • Use existing search functionality internally
    • Return related documents with similarity scores
    • Include cross-reference analysis

Leverage Existing APIs

  • Use existing /search endpoint logic for finding related documents
  • Use existing /documents endpoint to fetch full related documents
  • Use existing database connection and document retrieval functions

Phase 2: Frontend Implementation

New Tab Structure

  1. Upload Section

    • File drop zone
    • File type validation (PDF, DOC, DOCX, TXT)
    • Upload progress indicator
    • File preview/summary
  2. Processing Section

    • Processing status indicator
    • Document analysis summary
    • Key terms extraction display
  3. Results Section

    • Related documents list
    • Similarity scores
    • Cross-reference details
    • Document preview capability

UI Components Needed

  • File upload widget
  • Progress bars
  • Results grid/list
  • Document preview modal
  • Cross-reference visualization

Phase 3: Processing Logic

Document Processing Pipeline

  1. File Upload & Validation

    • Validate file type and size
    • Extract text content using appropriate libraries
    • Clean and normalize text
  2. Content Analysis

    • Extract key terms and phrases
    • Identify legal concepts
    • Generate search queries from content
  3. Cross-Reference Matching

    • Use existing search service (enhanced_rag_service or simple_search_service)
    • Multiple search strategies:
      • Full text similarity
      • Key terms matching
      • Legal concept matching
    • Rank results by relevance
  4. Results Processing

    • Format cross-reference results
    • Include similarity metrics
    • Group by document type or relevance

Technical Approach

Backend Dependencies

# New libraries needed
- python-multipart  # For file uploads
- PyPDF2 or pdfplumber  # PDF text extraction
- python-docx  # Word document processing

API Strategy

Recommendation: Create new endpoints because:

  • Current /search expects a text query, not document content
  • Need specialized document processing logic
  • Need different response format for cross-references
  • Upload functionality is entirely new

Frontend Strategy

  • Add new tab to existing tab system
  • Use existing styling and components where possible
  • Implement file upload using HTML5 File API
  • Use existing API calling patterns

File Structure

New Backend Files

embedding/
├── document_processor.py     # Handle file uploads and text extraction
├── cross_reference_service.py  # Cross-reference logic

New Frontend Components

frontend/
├── js/
│   ├── cross-reference.js    # Cross-reference tab logic
│   └── file-upload.js        # File upload utilities
├── css/
│   └── cross-reference.css   # Specific styling

API Endpoints Summary

  1. POST /upload-document - New endpoint needed
  2. POST /find-cross-references - New endpoint needed
  3. GET /documents - Use existing
  4. GET /documents/{id} - Use existing

Development Priority

  1. Backend document upload and processing
  2. Cross-reference matching logic
  3. Frontend tab and upload interface
  4. Results display and formatting
  5. Error handling and validation

Benefits of This Approach

  • Leverages existing search infrastructure
  • Maintains separation of concerns
  • Scalable and maintainable
  • Consistent with current API patterns
  • No database changes needed