# Running a Small Language Model on an Xbox Series S

> An honest engineering report on porting a small language model to an Xbox Series S: what runs today at 71 tok/s on the Zen 2 CPU, and the constraints I have not solved yet.

Published: 2026-06-26
Canonical: https://gianlucamazza.it/en/blog/slm-on-xbox
Tags: Edge Inference, On-Device AI, ONNX Runtime, Xbox, SLM, UWP

## Why this matters

An Xbox Series S is a Zen 2 machine: eight cores at 3.6 GHz with AVX2, 10 GB of unified GDDR6, an RDNA 2 GPU, and a one-time ~$19 Dev Mode unlock that lets you sideload your own code. On paper it is a capable, cheap, and largely unexplored substrate for local inference. So I tried to run a small language model on one, with no cloud in the path.

The short version: it works. The console runs SmolLM2-360M-Instruct (INT4) at about 71 tokens per second, fully on-device, through ONNX Runtime GenAI. The longer version is more useful, because the interesting part of this port is not the demo. It is the set of constraints I hit, and which of them I have not solved. This article leads with the open problems on purpose. I would rather publish the boundary of what I know than a screenshot that hides it.

I maintain the project, `xllama`, in the open. Every number below comes from its benchmark logs and constraint notes, not from memory.

## What actually runs today

The working configuration is deliberately boring:

- **Model:** SmolLM2-360M-Instruct, INT4, 403 MB on disk, context window 2048, ChatML prompt template.
- **Runtime:** ONNX Runtime GenAI on the CPU execution provider (Zen 2).
- **Package:** a self-contained UWP MSIX with the model bundled inside it, sideloaded via the Xbox Device Portal.

The decode loop is the standard ORT GenAI token pump, with an abort flag so the gamepad B button can cancel a generation mid-stream:

```cpp
while (!OgaGenerator_IsDone(gen.get())) {
    if (params.abort_flag && params.abort_flag->load()) break;
    oga_check(OgaGenerator_GenerateNextToken(gen.get()), "GenerateNextToken");

    const int32_t* next = nullptr;
    size_t n = 0;
    oga_check(OgaGenerator_GetNextTokens(gen.get(), &next, &n), "GetNextTokens");
    for (size_t i = 0; i < n; ++i) {
        const char* piece = nullptr;
        oga_check(OgaTokenizerStreamDecode(stream.get(), next[i], &piece), "decode");
        if (piece && *piece && params.on_token) params.on_token(piece);
    }
}
```

Measured on the console (ORT GenAI 0.13.2, INT4, n_ctx 2048):

| Threads | Decode tok/s | Peak working set |
| ------- | ------------ | ---------------- |
| auto    | 66.9         | 704 MB           |
| 4       | 71.4         | 771 MB           |
| 6       | 68.0         | 772 MB           |
| 8       | 28.2         | 771 MB           |

That is the baseline. Everything below is what stands between this and a result I would actually call finished.

## The open problems

### 1. I cannot prove the GPU is doing anything

This is the one that bothers me most. The whole reason to pick a console over a Raspberry Pi is the RDNA 2 GPU, and I cannot yet confirm I am using it.

SmolLM2-360M loads under the DirectML execution provider without crashing, once I disable the CPU memory arena and memory pattern planner:

```json
"session_options": {
  "provider_options": [
    { "dml": { "enable_cpu_mem_arena": "0", "enable_mem_pattern": "0" } }
  ]
}
```

It then produces output at about 71.7 tok/s, which is suspiciously close to the CPU baseline. That number is the problem, not the reassurance. ORT can silently fall back to CPU for operators the GPU path does not support, and a result indistinguishable from the CPU number is exactly what a silent fallback looks like. To tell the two apart I need a D3D profiler (PIX) or GPU hardware counters, and I have no profiling instrumentation on the console yet. Until I do, the honest finding is narrow: the 360M model fits the GPU memory pool, but whether it executes on the GPU is unconfirmed.

### 2. The GPU memory pool is small, and that caps everything

Larger models do not get this far. When `OgaCreateModel` initializes the DirectML provider for a model whose weights exceed the available GPU pool, the allocator returns null and the next use of that pointer faults:

```text
OgaCreateModel failed: SEH 0xC0000005 (STATUS_ACCESS_VIOLATION)
```

By watching where that boundary falls, the usable GPU-accessible pool for a UWP app on the Series S looks like roughly 768 MB. Phi-3.5-mini INT4 (~2.2 GB) reliably OOMs; the 360M model (403 MB) does not. I want to be careful here: that 768 MB figure is inference from observed out-of-memory behavior in my own tests, not a documented Xbox platform specification. I do not treat it as an authoritative claim about the console's internal memory layout. But as an engineering ceiling it is consistent, and it means any model near or above 1 GB is off the table for the GPU path regardless of whether problem 1 is ever solved.

### 3. The disk budget is tighter than the RAM budget

Before a model can OOM, it has to fit on disk. A freshly activated Dev Mode partition gives roughly 2.2 to 2.5 GB of free space, and deployment briefly needs about twice the package size because the MSIX is staged before it installs. In practice that means an on-disk model budget under 600 MB if I want to bundle it in the package, and a deploy that fails with `0x80070070` (disk full) if I push past it. The 403 MB model fits with room to spare. A 1.4 GB model does not, even though the console has 10 GB of RAM. Disk, not memory, is the first wall.

### 4. The in-app download path is written but unproven

The way out of the disk budget is to stop bundling the model and fetch it at first launch. I implemented that: a `ModelDownloader` that streams from a Hugging Face endpoint in chunks via `HttpClient`, with a resolution chain of LocalState, then the installed package, then a download fallback. The code exists and compiles. It has never actually run on the console, because the build that ships always finds the bundled model first and never reaches the fallback. So whether plain HTTPS to Hugging Face works from inside the Xbox AppContainer is still an open question. To test it I have to ship a build with no bundled model on purpose, which I have not done.

### 5. More threads make it slower

The thread table above hides a sharp cliff. Four threads is optimal at 71.4 tok/s. Eight threads drops to 28.2 tok/s, a regression of roughly 60%. The cores are not the constraint; memory bandwidth is. INT4 decoding on Zen 2 saturates the available bandwidth well before it saturates the eight cores, and adding threads past that point just adds contention. The fix is not clever code, it is a pinned `intra_op_num_threads=4`. But it is a reminder that the usual "use all the cores" instinct is actively wrong on this hardware.

### 6. It only runs in Dev Mode

Everything here depends on the ~$19 Dev Mode unlock. There is no path to a retail console. That is fine for a research baseline and a reproducible build, and it is a hard limit on calling this something a normal user could install. I am not going to pretend otherwise.

## The scars already paid

Two problems are solved, but only after they cost me real time, so they are worth recording.

The first was a crash inside `OgaCreateModel` on a model that loaded fine on Linux. ORT 1.24.4 calls `std::filesystem::weakly_canonical()` to validate the path of an external `.onnx.data` file, and on Windows that walks the path from the drive root upward. One of the intermediate segments is the Xbox AppContainer's user-manager directory, which the sandbox cannot read, so the walk hits `ACCESS_DENIED` and throws. The fix is to merge the external data into a single self-contained `model.onnx` at build time, so the validation path is never taken. A small Python script in CI does the merge.

The second was the XAML compiler crashing (`WMC9999`) during the build of a C++/WinRT project on a current Windows SDK. Rather than fight the markup compiler, I build the entire UI programmatically in C++ with `Windows.UI.Xaml.Controls`. No `.xaml` files, no metadata provider, no compiler pass to crash.

## What it would take to close each

To stop hand-waving about the GPU, I need on-device D3D profiling so I can confirm kernel execution and measure GPU tok/s against the CPU baseline for the same model and quantization. To make the GPU path worth confirming, I want a sub-400 MB INT4 candidate such as Qwen2.5-0.5B that comfortably fits the pool. To retire the disk budget as the binding constraint, I need to validate that Hugging Face download from the AppContainer actually works, then drop the bundled model from the package. None of these are research questions. They are instrumentation and legwork, which is usually where these projects actually live.

## FAQ

### How fast is it?

About 71 tokens per second of decode for SmolLM2-360M INT4 on the CPU execution provider at four threads, with a peak working set around 771 MB. Going to eight threads drops it to about 28 tok/s because memory bandwidth, not compute, is the bottleneck.

### Does the language model run on the Xbox GPU?

I cannot confirm it yet. The 360M model loads under the DirectML execution provider without an out-of-memory crash and produces output at about the same speed as the CPU path. That similarity is exactly what a silent CPU fallback would look like, and without on-device D3D profiling I cannot tell the two apart. The confirmed result today is CPU-only inference.

### Can a normal user install this on their Xbox?

No. It requires Xbox Dev Mode, a one-time paid unlock, and there is no path to a retail console. This is a reproducible research baseline, not a consumer application.

### Why such a small model?

Disk and memory budgets. The Dev Mode partition has only a few gigabytes free, deployment needs roughly twice the package size during install, and the GPU-accessible memory pool appears to be around 768 MB. A 403 MB INT4 model fits all three; a multi-gigabyte model fails the disk check before it ever loads.

---

The full build, benchmark logs, and constraint notes are in the [xllama repository](https://github.com/gianlucamazza/xllama). "Xbox" is a Microsoft trademark; this is an independent research project and is not affiliated with Microsoft. If you have run inference on console hardware and have profiling data I do not, I would like to compare notes.