Insights / Reference architectures

Private RAG on-prem — reference architecture.

The stack EMARQUE specialists deploy most often when an enterprise wants to chat with its own documents — without sending a sentence to a cloud API. Permission-aware, citation- grounded, and evaluated before it ships.

Scope this deployment Compare systems

The stack

Six layers, top to bottom.

1. Hardware substrate

AI PRO 500 for departmental ( ≤ 100 concurrent users), EMARQUE AI Server with 4–8 × H200 NVL or B200 for org-wide. ECC memory, U.2 NVMe pool for vector index residency.

2. Inference runtime

vLLM (preferred for production), Ollama, or NVIDIA Triton — pinned versions, NUMA-aware tensor parallelism, paged-attention enabled. Open-weight model (Llama 3.x, GPT-OSS, DeepSeek) sized to fit GPU memory at production concurrency.

3. Embedding & retrieval

BGE / E5 family embeddings (multilingual where the corpus needs it). Hybrid search: dense + BM25 + reranker. Vector store on disk (pgvector, Qdrant, Weaviate) — kept inside the network, no SaaS retrieval.

4. Document pipeline

Tika / unstructured for parsing, chunking at semantic boundaries, dedup, source-of-record links. Permission tags propagate from your SSO / IAM — retrieval filters by ACL at query time so users never see what they shouldn't.

5. Application layer

Streaming chat with token-level SSE, citations in the response object, retrieval-traces logged for audit. Lightweight orchestration (LangGraph, LlamaIndex Workflows, or a thin in-house layer) — agentic only where the value justifies the complexity.

6. Evaluation & observability

Golden set of 100–500 representative questions with curated answers. Nightly evaluation against retrieval recall + generation faithfulness. Prometheus + Grafana on the inference path; alert on first-token latency drift.

Sizing

Match the box to the concurrency.

Concurrency is the dominant sizing axis, with model size second. Use this table as a starting point — refine with the full sizing guide.

Users (concurrent)	≤ 5	5–100	100–1,000	1,000+
Recommended system	DGX Spark	AI PRO 500	EMARQUE AI Server	Multi-node AI Server / DGX B200
GPU memory	128 GB unified	192–384 GB	1.1 TB+ HBM3e	Several TB across nodes
Network	10 GbE	10/25 GbE	25/100 GbE	InfiniBand HDR
Model size sweet spot	≤ 30B quant	30–70B	70B FP8	70B+ multi-tenant

Before you commit hardware

Pre-deployment checklist.

Source-of-record corpus identified — file paths, owners, refresh cadence
Permission model documented — ACL fields, SSO group structure
Latency budget agreed (first-token, full-response)
Evaluation golden set drafted with stakeholders
Air-gap / data residency requirements signed off
Network / firewall posture confirmed with IT

Deploying this?

We've set this stack up on the hardware we sell.

Architecture consult to map your corpus, concurrency, and compliance posture onto a sized system.

Architecture consult More insights

02Talk to EMARQUE

Tell us about your workload.

Model size, concurrency, latency budget, deployment site. EMARQUE returns a quote in MYR within one Malaysian business day, sized to the workload — not the salesperson’s quota.

Request a quote Contact sales

01
Key Account Manager
+6012 627 2280
02
Request for Quotation
business@emarque.co