Insights / Client guides

How to size an on-prem AI system.

Eight numbers our solutions architects ask for before quoting hardware. Bring these to the first call and we'll come back with a sized system. Skip them and we'll end up scoping over several conversations.

Configure with these answers Architecture consult

Target model size

Parameter count and quantization plan together. A 70B model at Q4_K_M is ~40 GB on disk and roughly 50–60 GB resident on the GPU. A 70B model at FP16 is closer to 140 GB. Wrong assumption here breaks the whole sizing exercise.

What this drives

GPU memory, NVLink coherence, KV-cache headroom

Concurrent users / sessions

Peak concurrent active sessions, not registered users. A 70B inference engine that comfortably serves 10 concurrent users at sub-second first-token may queue badly at 100. Tell us the realistic peak, not the marketing number.

What this drives

GPU count, parallelism strategy, batch scheduler choice

Latency budget

First-token-out and full-response-out, separately. A chat UI feels broken if first-token > 1 s; an agent loop can tolerate 3–5 s if it's predictable. Define both, and define them per workload — RAG, agent, batch.

What this drives

GPU type (H200 vs B200/B300), tensor-parallel size, fabric

Context window

Average and tail. 8K average is very different from 200K tail when KV cache dominates GPU memory at scale. Long-context workloads are where Blackwell Ultra (B300) starts paying off over plain Blackwell.

What this drives

GPU memory per-card, paged-attention support, batch size

Deployment site

Office, server room, colo, data centre, edge, or air-gapped. Determines power envelope, cooling (air vs liquid), acoustics, networking, and what kind of system class you can physically install.

What this drives

Form factor (pedestal / 4U / 10U / liquid rack), power, cooling

Compliance / data residency

PDPA, GDPR, ISO 27001, sovereign-compute, classified, air-gapped, sector-specific (BNM, healthcare, public sector). Affects whether the system can connect to anything external, what audit trails you need, and what paperwork ships with it.

What this drives

Software hardening, audit logging, firewall posture, paperwork

Timeline

Shelf stock vs allocation. Hopper (H100/H200) is volume; Blackwell B200 is ramping; B300, GB300 NVL72, and Vera Rubin NVL72 are allocation-driven. Tell us the timeline before falling in love with a specific GPU.

What this drives

GPU choice, OEM variant, NVIDIA allocation queue

Operational ownership

Who runs the box after delivery? An internal HPC team can self-operate a rack-scale fabric; most enterprise IT teams want NVIDIA Mission Control and Enterprise Support to do that for them. Drives EMARQUE-built vs DGX choice.

What this drives

EMARQUE-built vs NVIDIA DGX, Care Plan tier, runbook

Three common sizing mistakes

What we see go wrong before the call.

Buying by GPU spec sheet

Picking the GPU first, then trying to find a chassis that fits. The deployment site usually constrains the choice more than the GPU does — start with the room, not the silicon.

Underestimating concurrency growth

Sizing for today's users instead of the realistic 12-month projection. The cheapest hardware is the one you don't have to replace in 18 months.

Ignoring the operational team

Buying a DGX SuperPOD building block when the team can't operate Mission Control — or buying EMARQUE-built when procurement actually requires NVIDIA-branded. Both happen.

Got most of these?

Drop them into the configurator — EMARQUE quotes through your Key Account Manager.

Open configurator More insights

02Talk to EMARQUE

Tell us about your workload.

Model size, concurrency, latency budget, deployment site. EMARQUE returns a quote in MYR within one Malaysian business day, sized to the workload — not the salesperson’s quota.

Request a quote Contact sales

01
Key Account Manager
+6012 627 2280
02
Request for Quotation
business@emarque.co

How to size an on-prem AI system.

Target model size

Concurrent users / sessions

Latency budget

Context window

Deployment site

Compliance / data residency

Timeline

Operational ownership

What we see go wrong before the call.

Buying by GPU spec sheet

Underestimating concurrency growth

Ignoring the operational team

Drop them into the configurator — EMARQUE quotes through your Key Account Manager.

Tell us about your workload.

Key Account Manager

Request for Quotation