NVIDIA GB10 Superchip
QOSCPU · Quality of Service for Ollama

QOSCPU

Intelligent Quality-of-Service management for Ollama LLM inference workloads, purpose-built for the NVIDIA GB10 Superchip.

20 CPU Cores NVIDIA GPU Ollama LLM QoS Rules Real-time Billing

Built for the NVIDIA GB10

The GB10 Superchip combines ARM CPU cores with an integrated NVIDIA GPU, making it the ideal edge inference platform.

10+10
E-Cores + P-Cores
128
GPU SMs
20W
GPU TDP
0.128
kg CO2/kWh (Swiss)
Component Specification QOSCPU Integration
Efficiency Cores (x10) ARM Cortex, 2808 MHz Per-core heatmap, E-cycle tracking, QoS allocation
Performance Cores (x10) ARM Cortex, 3900 MHz Per-core heatmap, P-cycle tracking, priority routing
NVIDIA GPU SM @ 2398 MHz, 20W TDP GPU-cycle billing, power-based utilization, real-time charts
Memory Unified CPU+GPU memory Memory usage monitoring per model
Networking Gigabit Ethernet Reverse proxy, WebSocket real-time streaming

Core Features

Everything you need to manage, monitor, and bill Ollama inference workloads on your GB10.

Real-time Monitoring

Live CPU and GPU metrics streamed via WebSocket. Per-core heatmaps with E/P core differentiation, updated every second.

QoS Rule Engine

Condition-based rules combining user, model, time, and resource thresholds. Actions include throttle, deny, allocate cores, and priority routing.

CPU Core Allocation

Assign specific E-cores and P-cores to users or models via QoS rules. Visual core allocation feedback on the heatmap with yellow ring indicators.

GPU Tracking

Power-based GPU utilization when NVML reports 0%. GPU cycle counting per request with real-time charts and per-user billing integration.

Usage Billing

Per-transaction and per-CPU-cycle billing modes. E/P/GPU cycle breakdown, CO2/year estimation, and per-user invoice generation.

Multi-user Management

Role-based access (Admin, Operator, User). API key authentication, JWT tokens, concurrent request limits, and full audit trail.

Smart Alerts

Threshold-based alerts for CPU, GPU, memory, and queue depth. Severity levels (info, warning, critical) with acknowledgement workflow.

Go CLI Client

Compiled Go binary for direct chat with Ollama through the QoS proxy. API key auth, model selection, streaming responses with token stats.

Live Dashboard

At-a-glance system health with KPI cards, real-time charts, core heatmaps, and active request tracking.

System Overview

The dashboard provides instant visibility into your GB10 inference workload:

  • Active requests, CPU/GPU usage, and queue depth KPI cards
  • Real-time 60-second area charts for CPU and GPU
  • CPU core heatmap with E-core (teal) and P-core (purple) differentiation
  • Active request table with user, model, duration, and resource allocation
  • Ollama-allocated cores highlighted with yellow ring
Dashboard Live PG Active Requests 3 CPU Usage 42.3% GPU Usage 28.5% Queue Length 1 CPU Usage (last 60s) 100% 50% 0% GPU Utilization (last 60s) CPU Core Heatmap Efficiency Cores C0 38% C1 52% C2 65% C3 8% C4 41% C5 3% C6 48% C7 35% C8 5% C9 55% Performance Cores C10 72% C11 85% C12 58% C13 12% C14 78% Active Requests USER MODEL DURATION STATUS alice mistral:7b 3.2s running bob qwen2.5:72b 12.7s running admin mistral:7b 1.1s queued

CPU & GPU Monitoring

Dedicated detail pages for CPU and GPU with historical charts, per-core breakdown, and circular gauges.

CPU Detail Page

Deep dive into CPU performance with per-core resolution:

  • Physical/logical core count, global usage, active requests
  • Real-time + historical area charts with selectable time range (1h, 6h, 24h, 7d)
  • Per-core multi-series chart: teal for E-cores, purple for P-cores
  • Circular gauge with color-coded thresholds (green/yellow/red)
  • Core heatmap with allocation indicators
CPU Detail Physical Cores 20 Logical Cores 20 Global Usage 42.3% Active Requests 3 42.3% CPU Global CPU (real-time) 1h 6h 24h 7d Per-Core Usage E-Cores P-Cores

GPU Detail Page

Full GPU monitoring with power-based utilization fallback:

  • GPU name, utilization, temperature, and power consumption
  • Circular utilization gauge with threshold colors
  • Memory usage progress bar (used/total)
  • Real-time + historical GPU utilization charts
  • Power-based estimation when NVML returns 0% (GB10 specific)
GPU Detail GPU Name NVIDIA GB10 Utilization 28.5% Temperature 45 C Power 14.3 W 28.5% GPU Utilization VRAM Usage 4,218 MB / 7,831 MB 53.9% GPU Utilization (real-time)

QoS Rule Engine

Define conditions and actions to intelligently manage inference workloads. Prioritize users, allocate cores, throttle or deny requests.

Rules Management

The rule engine evaluates conditions in priority order and triggers actions:

  • Conditions: user, model, time window, CPU/GPU threshold, queue depth, concurrent requests
  • Actions: allow, deny, throttle, set priority, allocate E/P-cores, rate limit
  • Toggle rules active/inactive with instant effect
  • Priority-based evaluation (0-100, higher = first)
  • Per-user dropdown for user-specific rules
QoS Rules + New Rule NAME PRIORITY CONDITIONS ACTIONS ACTIVE EDIT Limit Large Models 90 model contains "72b" allocate P-cores 10-14 Night Throttle 70 time 22:00-06:00 throttle 50% Guest Rate Limit 30 role = "user", concurrent > 2 deny CPU Overload Protection 95 cpuUsage > 90% deny, priority low Admin Priority 99 user = "admin" allow, priority high

Usage Billing

Per-user invoicing with CPU/GPU cycle tracking, CO2 estimation, and configurable pricing modes.

Billing Dashboard

Comprehensive cost tracking and environmental impact:

  • Three billing modes: per-transaction, per-CPU-cycle, combined
  • E-core, P-core, and GPU cycle breakdown per user
  • CO2/year estimation using Swiss grid factor (128 g/kWh)
  • Cost distribution charts (bar + pie)
  • Per-user invoice table with override editing
  • Selectable period: 24h, 7 days, 30 days, all time
Billing 24h 7d 30d All Mode: Transaction CPU Cycle Combined Total Invoice CHF 4.82 Transactions CHF 2.10 CPU Cycles CHF 2.72 Requests 42 GPU Cycles 1.2B CO2/year 2.4 kg Cost per User admin alice bob charlie Transaction CPU Cycles Cost Distribution admin 40% alice 25% bob 20% charlie 15% Per-User Invoice USER REQUESTS E-CYCLES P-CYCLES GPU-CYCLES CO2/yr TRANSACTION TOTAL admin 18 2.1B 4.8B 820M 1.1 kg CHF 0.90 CHF 1.93 alice 12 1.4B 3.2B 310M 0.7 kg CHF 0.60 CHF 1.28 bob 8 0.9B 2.1B 90M 0.5 kg CHF 0.40 CHF 0.97

System Architecture

A containerized microservice stack running on the GB10, orchestrated via Podman Compose.

Browser / API Client :3001 NGINX (reverse proxy) /api/ /ws/ / FastAPI Backend React SPA (Vite) PostgreSQL Redis (cache) Ollama (Podman + CUDA) NVIDIA GB10 GPU Go CLI Client API Key qoscpu-net (Podman bridge)

Technology Stack

Py

FastAPI + SQLAlchemy

Async Python backend with Alembic migrations
Re

React 18 + TypeScript

Vite 6 SPA with TailwindCSS and Recharts
Go

Go CLI Client

Zero-dependency binary, cross-compiled ARM64
Nx

NGINX

Reverse proxy with SSL termination
PG

PostgreSQL 16

Persistent storage with async driver
Rd

Redis

Caching and real-time pub/sub
Ol

Ollama

LLM inference engine with CUDA acceleration
Dk

Podman Compose

Multi-container orchestration

Go CLI Client

A compiled, zero-dependency binary for direct Ollama chat through the QoS proxy. Available for Linux ARM64 and Windows x64.

Terminal Chat Interface

Chat directly with any Ollama model through the QoS proxy:

  • API key or JWT authentication
  • Interactive model selection from available models
  • Real-time streaming responses with token stats
  • Conversation history management
  • Commands: /quit, /clear, /model, /history, /help
  • 5.4 MB binary, no runtime dependencies
qoscpu-chat -- GB10
$ ./qoscpu-chat --host http://192.168.1.103:8010 --user admin --pass *****

  QOSCPU Chat Client
  GB10 inference via QoS proxy

Authentification... OK
Chargement des modeles...

Modeles disponibles :

   1) mistral:7b                       7.2B Q4_K_M (llama)
   2) qwen2.5:72b                     72.7B Q4_K_M (qwen2)
   3) translategemma:27b               27.4B Q4_K_M (gemma3)
   4) olmo-3:latest                   7.3B Q4_K_M (olmo3)

Choix [1-4]: 1
Modele: mistral:7b

vous > Explain what Quality of Service means for LLM inference.

mistral:7b > Quality of Service (QoS) for LLM inference refers to the management and allocation of computational resources -- CPU cores, GPU cycles, memory, and network bandwidth -- to ensure fair and predictable performance across multiple concurrent users. It encompasses request prioritization, rate limiting, core pinning, and usage tracking to prevent any single user or workload from monopolizing the shared hardware.
[82 tokens, 16.4 tok/s, 5.0s total]

vous > /quit
Au revoir!