Run:ai on AWS — webinar notes (inference & autoscaling)

Conference

Sep 6, 2022

Run:ai on AWS — webinar notes (inference & autoscaling)

I joined a Run:ai webinar (Americas) about running and autoscaling ML inference on AWS. At the time I was deep in coursework and side projects that touched GPUs and containers, but production inference was still mostly “someone SSH’d to a box.” The session was useful because it framed inference as a scheduling and capacity problem—not only a model-export problem. Version française.

Why I watched

Training gets the glamour; inference pays the bills every hour the endpoint stays up. On AWS that usually means EKS (or similar), GPU node groups, HPA/KEDA-style scaling, and a finance team asking why utilization is 12% while latency spikes. Run:ai pitches itself as the control plane on top of Kubernetes for shared GPU clusters: fair queues, visibility into who consumed what, and policies so inference workloads do not get starved by batch training jobs—or the reverse.

I did not implement Run:ai afterward; these notes are what I kept for later platform work (quotas, observability, chargeback language). If you are comparing stacks today, also see vector search / RAG notes for the application side and Snowflake Data-for-Breakfast for another 2022 conference snapshot.

What Run:ai is optimizing (one paragraph)

The recurring theme: teams buy expensive GPUs, then run Kubernetes like generic CPU clusters and wonder why inference latency jitters or nodes sit idle. Run:ai adds a layer for workload placement, prioritization, and reporting on GPU-backed jobs—training, fine-tuning, and inference services in the same fleet. On AWS, that implies integration with how you already provision nodes (instance types, autoscaling groups, spot vs on-demand) and how you expose models (containers, possibly multi-replica deployments).

Dashboard — jobs and utilization

The UI slides were the “ops brain”: what is running, what is queued, and whether GPUs are actually busy or just reserved.

Job list / status — inference replicas vs batch jobs show up as first-class objects, not only anonymous pods.
Utilization — the gap between allocated and used GPU time is where cost leaks; the dashboard is meant to make that visible to platform owners, not only to the ML engineer who submitted the job.
Second view — alternate layout emphasized cluster-level vs project/team-level slices (useful when arguing for quotas).

Run:ai dashboard — jobs and resources

Run:ai dashboard — alternate view

Takeaway for me: if I cannot answer “who is holding the GPU right now?” in under a minute, I do not have a production inference platform—I have a shared SSH host with extra steps.

CLI — automation and GitOps-shaped workflows

The CLI section mirrored the UI objects: submit workloads, inspect queues, integrate with pipelines. That matters because inference autoscaling is rarely only horizontal pod autoscaler math; you also want versioned deploys, canary replicas, and teardown when a model is retired.

Typical patterns the slides pointed at:

CI builds a container → CLI (or API) registers an inference workload with CPU/GPU/memory bounds.
Autoscaling rules tie to queue depth, latency SLOs, or schedule (scale down nights/weekends).
Platform team keeps cluster policies in repo; researchers keep model artifacts in their own repos.

Run:ai CLI

Models — packaging and multi-instance scaling

Model-centric UI: one logical service, N replicas behind it. For inference, that is the difference between “one pod on a g5” and “a service that survives node loss.”

The multi-instance slide was about horizontal scale when request rate grows—duplicate inference workers, spread across nodes, keep scheduling aware of GPU memory headroom (large LLM weights vs smaller CV models).

Model view

Multi-instance scaling

Questions I wrote in the margin:

Cold start when a new replica mounts weights from S3/EFS?
Max replicas capped by GPU fragmenting (many small jobs vs one fat job)?
Whether scale-down waits for in-flight requests or hard-kills (SLO vs cost).

Workloads — queues, caps, and fairness

Managing workloads covered policies: per-team caps, priorities, preemption rules, and possibly time windows. That is the social contract when 10 teams share 8 GPUs.

Inference often wants steady baseline + burst; training wants long blocks. Without policy, training wins on duration and inference wins on escalation to leadership.
Good platforms expose queue position and estimated start—not only failure after six hours pending.

Managing workloads

Infrastructure — servers and what you pay for

Servers / cluster view mapped Kubernetes nodes to instance types, GPU counts, and health. On AWS this is where you reconcile:

Node group design (g5.xlarge vs larger cards, mixed instance types).
Cluster autoscaler adding nodes vs Run:ai scheduling filling existing capacity first.
Spot for fault-tolerant batch vs on-demand for latency-sensitive inference.

Server / cluster view

Demo and challenges (the honest slide)

The demo tied the story together: deploy or scale an inference workload, watch the dashboard move, show CLI parity. The challenges slide was the part worth keeping:

Challenge	Why it hurts
Low GPU utilization	Paying for idle cards while teams queue
Noisy neighbors	Training spikes starve inference SLOs
Observability gap	Prometheus shows pods, not “model v3 latency”
Autoscaling lag	New GPU nodes slow; replicas help only if schedulable
Cost attribution	Finance asks per product; labels and quotas must exist early
MLOps glue	Model registry, monitoring, and scheduler are three different tools

Demo

Challenges

What I would do differently after this session

Separate inference node pools (or taints) before mixing with training—cheaper than heroic scheduling later.
Define SLOs first (p95 latency, max queue time), then pick autoscaler signals—not the other way around.
Chargeback labels on day one (team, model, env) so the dashboard conversation with finance is possible.
Treat GPU sharing as a product decision: some teams need dedicated cards; others can share with quotas.

Snowflake Data-for-Breakfast — another 2022 data/infra morning session
AWS Cloud Practitioner journey — vocabulary for the AWS side
Vector databases / movie embeddings — ML systems teaching thread

References

Run:ai documentation (product evolves; check current NVIDIA integration notes)
AWS Marketplace — Run:ai (search “Run:ai” for listing region and install path)
Kubernetes GPU scheduling basics — useful contrast with a dedicated scheduler layer

Conference notes from September 2022; product names and AWS paths may have changed since.