1
Target model size
Parameter count and quantization plan together. A 70B model at Q4_K_M is ~40 GB on disk and roughly 50–60 GB resident on the GPU. A 70B model at FP16 is closer to 140 GB. Wrong assumption here breaks the whole sizing exercise.
What this drives
GPU memory, NVLink coherence, KV-cache headroom
2
Concurrent users / sessions
Peak concurrent active sessions, not registered users. A 70B inference engine that comfortably serves 10 concurrent users at sub-second first-token may queue badly at 100. Tell us the realistic peak, not the marketing number.
What this drives
GPU count, parallelism strategy, batch scheduler choice
3
Latency budget
First-token-out and full-response-out, separately. A chat UI feels broken if first-token > 1 s; an agent loop can tolerate 3–5 s if it's predictable. Define both, and define them per workload — RAG, agent, batch.
What this drives
GPU type (H200 vs B200/B300), tensor-parallel size, fabric
4
Context window
Average and tail. 8K average is very different from 200K tail when KV cache dominates GPU memory at scale. Long-context workloads are where Blackwell Ultra (B300) starts paying off over plain Blackwell.
What this drives
GPU memory per-card, paged-attention support, batch size
5
Deployment site
Office, server room, colo, data centre, edge, or air-gapped. Determines power envelope, cooling (air vs liquid), acoustics, networking, and what kind of system class you can physically install.
What this drives
Form factor (pedestal / 4U / 10U / liquid rack), power, cooling
6
Compliance / data residency
PDPA, GDPR, ISO 27001, sovereign-compute, classified, air-gapped, sector-specific (BNM, healthcare, public sector). Affects whether the system can connect to anything external, what audit trails you need, and what paperwork ships with it.
What this drives
Software hardening, audit logging, firewall posture, paperwork
7
Timeline
Shelf stock vs allocation. Hopper (H100/H200) is volume; Blackwell B200 is ramping; B300 and GB300 NVL72 are allocation-driven; Rubin is roadmap. Tell us the timeline before falling in love with a specific GPU.
What this drives
GPU choice, OEM variant, NVIDIA allocation queue
8
Operational ownership
Who runs the box after delivery? An internal HPC team can self-operate a rack-scale fabric; most enterprise IT teams want NVIDIA Mission Control and Enterprise Support to do that for them. Drives EMARQUE-built vs DGX choice.
What this drives
EMARQUE-built vs NVIDIA DGX, Care Plan tier, runbook