Skip to content
EMARQUE.AI
Insights / Client guides

How to size an on-prem AI system.

Eight numbers our solutions architects ask for before quoting hardware. Bring these to the first call and we'll come back with a sized system inside one business day. Skip them and we'll end up scoping over three.

1

Target model size

Parameter count and quantization plan together. A 70B model at Q4_K_M is ~40 GB on disk and roughly 50–60 GB resident on the GPU. A 70B model at FP16 is closer to 140 GB. Wrong assumption here breaks the whole sizing exercise.

What this drives

GPU memory, NVLink coherence, KV-cache headroom

2

Concurrent users / sessions

Peak concurrent active sessions, not registered users. A 70B inference engine that comfortably serves 10 concurrent users at sub-second first-token may queue badly at 100. Tell us the realistic peak, not the marketing number.

What this drives

GPU count, parallelism strategy, batch scheduler choice

3

Latency budget

First-token-out and full-response-out, separately. A chat UI feels broken if first-token > 1 s; an agent loop can tolerate 3–5 s if it's predictable. Define both, and define them per workload — RAG, agent, batch.

What this drives

GPU type (H200 vs B200/B300), tensor-parallel size, fabric

4

Context window

Average and tail. 8K average is very different from 200K tail when KV cache dominates GPU memory at scale. Long-context workloads are where Blackwell Ultra (B300) starts paying off over plain Blackwell.

What this drives

GPU memory per-card, paged-attention support, batch size

5

Deployment site

Office, server room, colo, data centre, edge, or air-gapped. Determines power envelope, cooling (air vs liquid), acoustics, networking, and what kind of system class you can physically install.

What this drives

Form factor (pedestal / 4U / 10U / liquid rack), power, cooling

6

Compliance / data residency

PDPA, GDPR, ISO 27001, sovereign-compute, classified, air-gapped, sector-specific (BNM, healthcare, public sector). Affects whether the system can connect to anything external, what audit trails you need, and what paperwork ships with it.

What this drives

Software hardening, audit logging, firewall posture, paperwork

7

Timeline

Shelf stock vs allocation. Hopper (H100/H200) is volume; Blackwell B200 is ramping; B300 and GB300 NVL72 are allocation-driven; Rubin is roadmap. Tell us the timeline before falling in love with a specific GPU.

What this drives

GPU choice, OEM variant, NVIDIA allocation queue

8

Operational ownership

Who runs the box after delivery? An internal HPC team can self-operate a rack-scale fabric; most enterprise IT teams want NVIDIA Mission Control and Enterprise Support to do that for them. Drives EMARQUE-built vs DGX choice.

What this drives

EMARQUE-built vs NVIDIA DGX, Care Plan tier, runbook

Three common sizing mistakes

What we see go wrong before the call.

Buying by GPU spec sheet

Picking the GPU first, then trying to find a chassis that fits. The deployment site usually constrains the choice more than the GPU does — start with the room, not the silicon.

Underestimating concurrency growth

Sizing for today's users instead of the realistic 12-month projection. The cheapest hardware is the one you don't have to replace in 18 months.

Ignoring the operational team

Buying a DGX SuperPOD building block when the team can't operate Mission Control — or buying EMARQUE-built when procurement actually requires NVIDIA-branded. Both happen.

Got most of these?

Drop them into the configurator — we'll quote inside a business day.

Talk to EMARQUE

Tell us about your workload.

Model size, concurrency, latency budget, deployment site. EMARQUE returns a quote in MYR within one Malaysian business day, sized to the workload — not the salesperson’s quota.

  1. 01

    Key Account Manager

    +6012 627 2280
  2. 02

    Request for Quotation

    business@emarque.co