1. Hardware substrate
AI PRO 500 for departmental ( ≤ 100 concurrent users), EMARQUE AI Server with 4–8 × H200 NVL or B200 for org-wide. ECC memory, U.2 NVMe pool for vector index residency.
The stack EMARQUE specialists deploy most often when an enterprise wants to chat with its own documents — without sending a sentence to a cloud API. Permission-aware, citation- grounded, and evaluated before it ships.
AI PRO 500 for departmental ( ≤ 100 concurrent users), EMARQUE AI Server with 4–8 × H200 NVL or B200 for org-wide. ECC memory, U.2 NVMe pool for vector index residency.
vLLM (preferred for production), Ollama, or NVIDIA Triton — pinned versions, NUMA-aware tensor parallelism, paged-attention enabled. Open-weight model (Llama 3.x, GPT-OSS, DeepSeek) sized to fit GPU memory at production concurrency.
BGE / E5 family embeddings (multilingual where the corpus needs it). Hybrid search: dense + BM25 + reranker. Vector store on disk (pgvector, Qdrant, Weaviate) — kept inside the network, no SaaS retrieval.
Tika / unstructured for parsing, chunking at semantic boundaries, dedup, source-of-record links. Permission tags propagate from your SSO / IAM — retrieval filters by ACL at query time so users never see what they shouldn't.
Streaming chat with token-level SSE, citations in the response object, retrieval-traces logged for audit. Lightweight orchestration (LangGraph, LlamaIndex Workflows, or a thin in-house layer) — agentic only where the value justifies the complexity.
Golden set of 100–500 representative questions with curated answers. Nightly evaluation against retrieval recall + generation faithfulness. Prometheus + Grafana on the inference path; alert on first-token latency drift.
Concurrency is the dominant sizing axis, with model size second. Use this table as a starting point — refine with the full sizing guide.
| Users (concurrent) | ≤ 5 | 5–100 | 100–1,000 | 1,000+ |
| Recommended system | DGX Spark | AI PRO 500 | EMARQUE AI Server | Multi-node AI Server / DGX B200 |
| GPU memory | 128 GB unified | 192–384 GB | 1.1 TB+ HBM3e | Several TB across nodes |
| Network | 10 GbE | 10/25 GbE | 25/100 GbE | InfiniBand HDR |
| Model size sweet spot | ≤ 30B quant | 30–70B | 70B FP8 | 70B+ multi-tenant |
Architecture consult to map your corpus, concurrency, and compliance posture onto a sized system — usually within one business day of an initial brief.
Tell us about your workload. We reply within one business day with a quote sized to fit.