Skip to main content

Architecture

System Diagram

Service Descriptions

cl-frontend

PropertyValue
Imagenode:20-alpine (custom build)
Port3000 (public)
TechNext.js 16, React 19, Tailwind CSS v4
RoleServes the web UI, proxies API calls to cl-backend
Volumes./frontend/src and ./frontend/public mounted for hot-reload
Env varsBACKEND_INTERNAL_URL, NEXT_PUBLIC_FRONTEND_URL

The frontend never communicates directly with the database or Redis. All data access goes through the backend API.

cl-backend

PropertyValue
Imagepython:3.12-slim (custom build)
Port8000 (exposed only within cl-network, not published to host)
TechFastAPI, Uvicorn, SQLAlchemy (async), Alembic, Tesseract OCR
RoleREST API, authentication, file uploads, database migrations, seed data
Volumes./backend (hot-reload), cl-storage, cl-config

On startup, the backend:

  1. Creates the pgvector PostgreSQL extension
  2. Runs alembic upgrade head to apply all migrations
  3. Executes run_seed() to create the default admin, project, system config, action pathways, and playbook data

cl-worker

PropertyValue
ImageSame as cl-backend (python:3.12-slim)
PortNone (no public or internal port)
TechCelery, same Python dependencies as backend
RoleProcesses documents through the AI pipeline
Commandcelery -A app.celery_app worker --loglevel=info --concurrency=${CELERY_CONCURRENCY}
Volumes./backend (shared code), cl-storage, cl-config

The worker uses a separate Dockerfile (Dockerfile.worker) with a default CELERY_CONCURRENCY=2. This controls how many documents can be processed in parallel.

cl-db (cl-postgres)

PropertyValue
Imagepgvector/pgvector:pg16
Port5432 (published to host for development; restrict in production)
TechPostgreSQL 16 with pgvector extension
RolePrimary data store for all application data and vector embeddings
Volumecl-pgdata for persistent storage
Health checkpg_isready every 5 seconds

cl-redis

PropertyValue
Imageredis:7-alpine
Port6379 (published to host for development; restrict in production)
TechRedis 7
RoleCelery task broker (db 0) and result backend (db 1)
Volumecl-redisdata for persistence
Health checkredis-cli ping every 5 seconds

Data Flow: Document Processing Pipeline

Pipeline Stages

The pipeline progresses through these statuses:

StatusStageDescription
queuedPre-flightAI provider connectivity verified before starting
extracting1Text extraction from PDF (PyPDF2), DOCX (python-docx), or images (Tesseract OCR)
classifying2AI classifies document type (NDA, MSA, SOW, etc.) and extracts parties, dates, governing law
playbook2bPlaybook embeddings generated; structured fields extracted with playbook context
analyzing3All clauses extracted, categorized, and risk-scored against playbook standards
embedding4Document chunked and embedded into 1536-dimension vectors for semantic search
storing5Detailed contract review report generated with per-clause risk ratings
completeDoneObligations extracted; document ready for review
failedErrorPipeline stopped; failed_at_stage and error_message recorded

Data Flow: Version-Aware Pipeline (Subsequent Versions)

When a revised version is uploaded (parent_document_id set), the pipeline runs the same stages but with version-aware logic. Stages marked ★ behave differently.

Version-Aware Pipeline Stages

The pipeline progresses through the same statuses, with differences noted:

StatusStagev1 (Initial)v2+ (Subsequent) ★
queuedPre-flightAI provider checkSame
extracting1Extract text from documentSame — extract text from new document
classifying2AI classifies from scratch★ Version context injected — carries forward type
playbook2bExtract structured fields★ Carry forward fields, re-extract only those touched by diff
analyzing3AI analyzes all clauses★ Version context injected into clause analysis
embedding4Generate vector embeddingsSame
storing5AI generates full report★ Diff-gated: unchanged clauses copied forward, impacted clauses re-validated
completeDoneObligations extracted fresh★ Obligations carried forward from v1, genuinely new ones added
failedErrorStage + error recordedSame
Cost Reduction

The version-aware pipeline reduces AI token usage by 70%+ for minor revisions. Only impacted clauses consume AI tokens — unchanged clauses are copied forward programmatically.

Size Guard

If the raw text sizes between v1 and v2 differ by more than 2x (text extraction artifact, not actual content change), the per-clause validation is skipped entirely and all clauses are carried forward unchanged.

Database Schema Overview

Key Tables

TablePurpose
usersUser accounts with roles (admin, user) and bcrypt password hashes
projectsOrganizational containers (matter, deal, case, project) for documents
project_membersRole-based project membership (owner, member, viewer)
documentsUploaded files with pipeline status, classification, version tracking
document_raw_textsExtracted text per page (text, table, image_description)
document_metadataExtracted clauses with type, text, confidence score, and tags
document_embeddings1536-dimension vector embeddings per chunk (pgvector)
document_reportsAI-generated reports (contract_review, nda_triage, etc.) as JSONB
document_extracted_dataStructured field extraction (parties, dates, amounts) as JSONB
document_actionsAvailable action pathways per document type
document_obligationsExtracted obligations with due dates, recurrence, and responsible parties
clause_decisionsAttorney decisions on individual clauses (accept, revise, skip)
action_pathway_configsConfigurable action options per document classification
cross_document_reportsCross-document intelligence reports for a project
cross_document_decisionsDecisions on cross-document findings
playbook_entriesOrganization's contract review standards (5 categories)
playbook_tagsTags for filtering playbook entries (practice area, contract type, jurisdiction)
playbook_embeddingsVector embeddings for semantic playbook retrieval
conversationsAI chat sessions linked to documents
conversation_messagesIndividual messages in a conversation with source references
system_configKey-value system configuration (wizard status, file types, etc.)
ai_providersConfigured AI providers with encrypted API keys
ai_capability_mappingMaps AI capabilities to specific providers and models