Architecture

System Diagram

Service Descriptions

cl-frontend

Property	Value
Image	`node:20-alpine` (custom build)
Port	`3000` (public)
Tech	Next.js 16, React 19, Tailwind CSS v4
Role	Serves the web UI, proxies API calls to cl-backend
Volumes	`./frontend/src` and `./frontend/public` mounted for hot-reload
Env vars	`BACKEND_INTERNAL_URL`, `NEXT_PUBLIC_FRONTEND_URL`

The frontend never communicates directly with the database or Redis. All data access goes through the backend API.

cl-backend

Property	Value
Image	`python:3.12-slim` (custom build)
Port	`8000` (exposed only within `cl-network`, not published to host)
Tech	FastAPI, Uvicorn, SQLAlchemy (async), Alembic, Tesseract OCR
Role	REST API, authentication, file uploads, database migrations, seed data
Volumes	`./backend` (hot-reload), `cl-storage`, `cl-config`

On startup, the backend:

Creates the pgvector PostgreSQL extension
Runs alembic upgrade head to apply all migrations
Executes run_seed() to create the default admin, project, system config, action pathways, and playbook data

cl-worker

Property	Value
Image	Same as cl-backend (`python:3.12-slim`)
Port	None (no public or internal port)
Tech	Celery, same Python dependencies as backend
Role	Processes documents through the AI pipeline
Command	`celery -A app.celery_app worker --loglevel=info --concurrency=${CELERY_CONCURRENCY}`
Volumes	`./backend` (shared code), `cl-storage`, `cl-config`

The worker uses a separate Dockerfile (Dockerfile.worker) with a default CELERY_CONCURRENCY=2. This controls how many documents can be processed in parallel.

cl-db (cl-postgres)

Property	Value
Image	`pgvector/pgvector:pg16`
Port	`5432` (published to host for development; restrict in production)
Tech	PostgreSQL 16 with pgvector extension
Role	Primary data store for all application data and vector embeddings
Volume	`cl-pgdata` for persistent storage
Health check	`pg_isready` every 5 seconds

cl-redis

Property	Value
Image	`redis:7-alpine`
Port	`6379` (published to host for development; restrict in production)
Tech	Redis 7
Role	Celery task broker (db 0) and result backend (db 1)
Volume	`cl-redisdata` for persistence
Health check	`redis-cli ping` every 5 seconds

Data Flow: Document Processing Pipeline

Pipeline Stages

The pipeline progresses through these statuses:

Status	Stage	Description
`queued`	Pre-flight	AI provider connectivity verified before starting
`extracting`	1	Text extraction from PDF (PyPDF2), DOCX (python-docx), or images (Tesseract OCR)
`classifying`	2	AI classifies document type (NDA, MSA, SOW, etc.) and extracts parties, dates, governing law
`playbook`	2b	Playbook embeddings generated; structured fields extracted with playbook context
`analyzing`	3	All clauses extracted, categorized, and risk-scored against playbook standards
`embedding`	4	Document chunked and embedded into 1536-dimension vectors for semantic search
`storing`	5	Detailed contract review report generated with per-clause risk ratings
`complete`	Done	Obligations extracted; document ready for review
`failed`	Error	Pipeline stopped; `failed_at_stage` and `error_message` recorded

Data Flow: Version-Aware Pipeline (Subsequent Versions)

When a revised version is uploaded (parent_document_id set), the pipeline runs the same stages but with version-aware logic. Stages marked ★ behave differently.

Version-Aware Pipeline Stages

The pipeline progresses through the same statuses, with differences noted:

Status	Stage	v1 (Initial)	v2+ (Subsequent) ★
`queued`	Pre-flight	AI provider check	Same
`extracting`	1	Extract text from document	Same — extract text from new document
`classifying`	2	AI classifies from scratch	★ Version context injected — carries forward type
`playbook`	2b	Extract structured fields	★ Carry forward fields, re-extract only those touched by diff
`analyzing`	3	AI analyzes all clauses	★ Version context injected into clause analysis
`embedding`	4	Generate vector embeddings	Same
`storing`	5	AI generates full report	★ Diff-gated: unchanged clauses copied forward, impacted clauses re-validated
`complete`	Done	Obligations extracted fresh	★ Obligations carried forward from v1, genuinely new ones added
`failed`	Error	Stage + error recorded	Same

Cost Reduction

The version-aware pipeline reduces AI token usage by 70%+ for minor revisions. Only impacted clauses consume AI tokens — unchanged clauses are copied forward programmatically.

Size Guard

If the raw text sizes between v1 and v2 differ by more than 2x (text extraction artifact, not actual content change), the per-clause validation is skipped entirely and all clauses are carried forward unchanged.

Database Schema Overview

Key Tables

Table	Purpose
`users`	User accounts with roles (admin, user) and bcrypt password hashes
`projects`	Organizational containers (matter, deal, case, project) for documents
`project_members`	Role-based project membership (owner, member, viewer)
`documents`	Uploaded files with pipeline status, classification, version tracking
`document_raw_texts`	Extracted text per page (text, table, image_description)
`document_metadata`	Extracted clauses with type, text, confidence score, and tags
`document_embeddings`	1536-dimension vector embeddings per chunk (pgvector)
`document_reports`	AI-generated reports (contract_review, nda_triage, etc.) as JSONB
`document_extracted_data`	Structured field extraction (parties, dates, amounts) as JSONB
`document_actions`	Available action pathways per document type
`document_obligations`	Extracted obligations with due dates, recurrence, and responsible parties
`clause_decisions`	Attorney decisions on individual clauses (accept, revise, skip)
`action_pathway_configs`	Configurable action options per document classification
`cross_document_reports`	Cross-document intelligence reports for a project
`cross_document_decisions`	Decisions on cross-document findings
`playbook_entries`	Organization's contract review standards (5 categories)
`playbook_tags`	Tags for filtering playbook entries (practice area, contract type, jurisdiction)
`playbook_embeddings`	Vector embeddings for semantic playbook retrieval
`conversations`	AI chat sessions linked to documents
`conversation_messages`	Individual messages in a conversation with source references
`system_config`	Key-value system configuration (wizard status, file types, etc.)
`ai_providers`	Configured AI providers with encrypted API keys
`ai_capability_mapping`	Maps AI capabilities to specific providers and models
`groups`	Named collections of users for bulk access management
`user_groups`	Many-to-many join between users and groups (with `added_by` audit trail)
`project_access`	Per-project access grants for individual users or groups, with access levels (viewer, editor, admin)
`sso_config`	OIDC Single Sign-On configuration (provider, client credentials encrypted, default role/group)

System Diagram​

Service Descriptions​

cl-frontend​

cl-backend​

cl-worker​

cl-db (cl-postgres)​

cl-redis​

Data Flow: Document Processing Pipeline​

Pipeline Stages​

Data Flow: Version-Aware Pipeline (Subsequent Versions)​

Version-Aware Pipeline Stages​

Database Schema Overview​

Key Tables​