Skip to content
EMARQUE.AI
Insights / Reference architectures

Private RAG on-prem — reference architecture.

The stack EMARQUE specialists deploy most often when an enterprise wants to chat with its own documents — without sending a sentence to a cloud API. Permission-aware, citation- grounded, and evaluated before it ships.

The stack

Six layers, top to bottom.

1. Hardware substrate

AI PRO 500 for departmental ( ≤ 100 concurrent users), EMARQUE AI Server with 4–8 × H200 NVL or B200 for org-wide. ECC memory, U.2 NVMe pool for vector index residency.

2. Inference runtime

vLLM (preferred for production), Ollama, or NVIDIA Triton — pinned versions, NUMA-aware tensor parallelism, paged-attention enabled. Open-weight model (Llama 3.x, GPT-OSS, DeepSeek) sized to fit GPU memory at production concurrency.

3. Embedding & retrieval

BGE / E5 family embeddings (multilingual where the corpus needs it). Hybrid search: dense + BM25 + reranker. Vector store on disk (pgvector, Qdrant, Weaviate) — kept inside the network, no SaaS retrieval.

4. Document pipeline

Tika / unstructured for parsing, chunking at semantic boundaries, dedup, source-of-record links. Permission tags propagate from your SSO / IAM — retrieval filters by ACL at query time so users never see what they shouldn't.

5. Application layer

Streaming chat with token-level SSE, citations in the response object, retrieval-traces logged for audit. Lightweight orchestration (LangGraph, LlamaIndex Workflows, or a thin in-house layer) — agentic only where the value justifies the complexity.

6. Evaluation & observability

Golden set of 100–500 representative questions with curated answers. Nightly evaluation against retrieval recall + generation faithfulness. Prometheus + Grafana on the inference path; alert on first-token latency drift.

Sizing

Match the box to the concurrency.

Concurrency is the dominant sizing axis, with model size second. Use this table as a starting point — refine with the full sizing guide.

Users (concurrent)≤ 55–100100–1,0001,000+
Recommended systemDGX SparkAI PRO 500EMARQUE AI ServerMulti-node AI Server / DGX B200
GPU memory128 GB unified192–384 GB1.1 TB+ HBM3eSeveral TB across nodes
Network10 GbE10/25 GbE25/100 GbEInfiniBand HDR
Model size sweet spot≤ 30B quant30–70B70B FP870B+ multi-tenant
Before you commit hardware

Pre-deployment checklist.

  • Source-of-record corpus identified — file paths, owners, refresh cadence
  • Permission model documented — ACL fields, SSO group structure
  • Latency budget agreed (first-token, full-response)
  • Evaluation golden set drafted with stakeholders
  • Air-gap / data residency requirements signed off
  • Network / firewall posture confirmed with IT
Deploying this?

We've set this stack up on the hardware we sell.

Architecture consult to map your corpus, concurrency, and compliance posture onto a sized system — usually within one business day of an initial brief.

Talk to EMARQUE

Tell us about your workload.

Model size, concurrency, latency budget, deployment site. EMARQUE returns a quote in MYR within one Malaysian business day, sized to the workload — not the salesperson’s quota.

  1. 01

    Key Account Manager

    +6012 627 2280
  2. 02

    Request for Quotation

    business@emarque.co