Architecture
System Diagram
Service Descriptions
cl-frontend
| Property | Value |
|---|---|
| Image | node:20-alpine (custom build) |
| Port | 3000 (public) |
| Tech | Next.js 16, React 19, Tailwind CSS v4 |
| Role | Serves the web UI, proxies API calls to cl-backend |
| Volumes | ./frontend/src and ./frontend/public mounted for hot-reload |
| Env vars | BACKEND_INTERNAL_URL, NEXT_PUBLIC_FRONTEND_URL |
The frontend never communicates directly with the database or Redis. All data access goes through the backend API.
cl-backend
| Property | Value |
|---|---|
| Image | python:3.12-slim (custom build) |
| Port | 8000 (exposed only within cl-network, not published to host) |
| Tech | FastAPI, Uvicorn, SQLAlchemy (async), Alembic, Tesseract OCR |
| Role | REST API, authentication, file uploads, database migrations, seed data |
| Volumes | ./backend (hot-reload), cl-storage, cl-config |
On startup, the backend:
- Creates the
pgvectorPostgreSQL extension - Runs
alembic upgrade headto apply all migrations - Executes
run_seed()to create the default admin, project, system config, action pathways, and playbook data
cl-worker
| Property | Value |
|---|---|
| Image | Same as cl-backend (python:3.12-slim) |
| Port | None (no public or internal port) |
| Tech | Celery, same Python dependencies as backend |
| Role | Processes documents through the AI pipeline |
| Command | celery -A app.celery_app worker --loglevel=info --concurrency=${CELERY_CONCURRENCY} |
| Volumes | ./backend (shared code), cl-storage, cl-config |
The worker uses a separate Dockerfile (Dockerfile.worker) with a default CELERY_CONCURRENCY=2. This controls how many documents can be processed in parallel.
cl-db (cl-postgres)
| Property | Value |
|---|---|
| Image | pgvector/pgvector:pg16 |
| Port | 5432 (published to host for development; restrict in production) |
| Tech | PostgreSQL 16 with pgvector extension |
| Role | Primary data store for all application data and vector embeddings |
| Volume | cl-pgdata for persistent storage |
| Health check | pg_isready every 5 seconds |
cl-redis
| Property | Value |
|---|---|
| Image | redis:7-alpine |
| Port | 6379 (published to host for development; restrict in production) |
| Tech | Redis 7 |
| Role | Celery task broker (db 0) and result backend (db 1) |
| Volume | cl-redisdata for persistence |
| Health check | redis-cli ping every 5 seconds |
Data Flow: Document Processing Pipeline
Pipeline Stages
The pipeline progresses through these statuses:
| Status | Stage | Description |
|---|---|---|
queued | Pre-flight | AI provider connectivity verified before starting |
extracting | 1 | Text extraction from PDF (PyPDF2), DOCX (python-docx), or images (Tesseract OCR) |
classifying | 2 | AI classifies document type (NDA, MSA, SOW, etc.) and extracts parties, dates, governing law |
playbook | 2b | Playbook embeddings generated; structured fields extracted with playbook context |
analyzing | 3 | All clauses extracted, categorized, and risk-scored against playbook standards |
embedding | 4 | Document chunked and embedded into 1536-dimension vectors for semantic search |
storing | 5 | Detailed contract review report generated with per-clause risk ratings |
complete | Done | Obligations extracted; document ready for review |
failed | Error | Pipeline stopped; failed_at_stage and error_message recorded |
Data Flow: Version-Aware Pipeline (Subsequent Versions)
When a revised version is uploaded (parent_document_id set), the pipeline runs the same stages but with version-aware logic. Stages marked ★ behave differently.
Version-Aware Pipeline Stages
The pipeline progresses through the same statuses, with differences noted:
| Status | Stage | v1 (Initial) | v2+ (Subsequent) ★ |
|---|---|---|---|
queued | Pre-flight | AI provider check | Same |
extracting | 1 | Extract text from document | Same — extract text from new document |
classifying | 2 | AI classifies from scratch | ★ Version context injected — carries forward type |
playbook | 2b | Extract structured fields | ★ Carry forward fields, re-extract only those touched by diff |
analyzing | 3 | AI analyzes all clauses | ★ Version context injected into clause analysis |
embedding | 4 | Generate vector embeddings | Same |
storing | 5 | AI generates full report | ★ Diff-gated: unchanged clauses copied forward, impacted clauses re-validated |
complete | Done | Obligations extracted fresh | ★ Obligations carried forward from v1, genuinely new ones added |
failed | Error | Stage + error recorded | Same |
The version-aware pipeline reduces AI token usage by 70%+ for minor revisions. Only impacted clauses consume AI tokens — unchanged clauses are copied forward programmatically.
If the raw text sizes between v1 and v2 differ by more than 2x (text extraction artifact, not actual content change), the per-clause validation is skipped entirely and all clauses are carried forward unchanged.
Database Schema Overview
Key Tables
| Table | Purpose |
|---|---|
users | User accounts with roles (admin, user) and bcrypt password hashes |
projects | Organizational containers (matter, deal, case, project) for documents |
project_members | Role-based project membership (owner, member, viewer) |
documents | Uploaded files with pipeline status, classification, version tracking |
document_raw_texts | Extracted text per page (text, table, image_description) |
document_metadata | Extracted clauses with type, text, confidence score, and tags |
document_embeddings | 1536-dimension vector embeddings per chunk (pgvector) |
document_reports | AI-generated reports (contract_review, nda_triage, etc.) as JSONB |
document_extracted_data | Structured field extraction (parties, dates, amounts) as JSONB |
document_actions | Available action pathways per document type |
document_obligations | Extracted obligations with due dates, recurrence, and responsible parties |
clause_decisions | Attorney decisions on individual clauses (accept, revise, skip) |
action_pathway_configs | Configurable action options per document classification |
cross_document_reports | Cross-document intelligence reports for a project |
cross_document_decisions | Decisions on cross-document findings |
playbook_entries | Organization's contract review standards (5 categories) |
playbook_tags | Tags for filtering playbook entries (practice area, contract type, jurisdiction) |
playbook_embeddings | Vector embeddings for semantic playbook retrieval |
conversations | AI chat sessions linked to documents |
conversation_messages | Individual messages in a conversation with source references |
system_config | Key-value system configuration (wizard status, file types, etc.) |
ai_providers | Configured AI providers with encrypted API keys |
ai_capability_mapping | Maps AI capabilities to specific providers and models |