AI Knowledge Platform

OngoingLive

Modular AI knowledge platform — live at platform.bridging-data.com. The platform includes Technology Radar, Competitor Intelligence, Regulatory Radar and AI Chat, all running serverlessly on AWS. Phase 6 features a Corporate LLM prototype with Cognito authentication.

Technologies

PythonFastAPIPostgreSQLpgvectorAWS RDSOpenAI APIAnthropic ClaudeAWS S3AWS Lambda (Container)AWS EventBridgeAWS SNSAWS CloudFrontSQLAlchemyAlembicDockerNext.jsMangumboto3pdfplumberBeautifulSoup4marked.jsGitHub Actions

Problem

Organisations need a cost-efficient AI system that processes diverse document formats, retrieves contextual answers with transparent source attribution, and scales from a public demo to private knowledge management — without vendor lock-in.

Approach

Layered architecture with separate ingestion, processing, storage, retrieval, and LLM layers. PostgreSQL + pgvector (HNSW index) for vector storage, SHA-256 deduplication to avoid unnecessary re-embeddings, abstracted LLM providers (OpenAI/Anthropic) for vendor independence. Cost controls built in as a first-class requirement.

Result

Phase 1 ✓ — Public RAG demo live at bridging-data.com. Phase 2 ✓ — Technology Radar pipeline (weekly Lambda, Mon 06:00 UTC). Phase 3 ✓ — Private Knowledge Hub (65+ documents indexed, local). Phase 4 ✓ — Regulatory Radar pipeline (monthly Lambda, 6 sources: NIST, OWASP, FINMA, EU AI Act, GDPR). Phase 5 ✓ — Competitor Radar pipeline (weekly Lambda, 6 companies, 21 sources). Phase 5+ ✓ — Knowledge Platform UI (Agent Center, AI Chat, Skills Hub, Reports Dashboard — Next.js SPA). Phase 5+ ✓ — ReportIndexerAgent: new pipeline reports are automatically indexed into the vector store after each run. Metrics: 5 Lambda Functions deployed · 3 automated pipelines · 72+ documents in vector store · 114 competitor signals · 21 reports generated.

Learnings

Python package names must not shadow stdlib modules (platform → aiplatform); Alembic requires synchronous drivers (psycopg2) while FastAPI uses async (asyncpg) — two separate connection strings solve this; HNSW index in pgvector requires no training phase; cost controls must be built into core infrastructure early, not added later. asyncpg warm-container reuse: await engine.dispose() must run inside the same asyncio.run() event loop. Docker ECR layer caching on Windows requires a timestamp-tag workaround as identical layer hashes are not cached deterministically. Session isolation in multi-phase pipelines: each phase opens its own DB session so that failures in later phases do not roll back commits from earlier ones. Content-hash dedup prevents unnecessary re-embeddings when pipeline output is identical under a new S3 key.

Relevance

Demonstrates complete data pipeline thinking (ingestion → preprocessing → vector storage → retrieval → LLM response generation), serverless AWS automation with EventBridge and Lambda, practical cost management, and the ability to design and operate a growing, layered AI platform from scratch in production.

Architecture

Open fullscreen ↗