Lab · Instrument 03

Adversarial examples playground

A neural network can be confidently, catastrophically wrong on an input that looks completely normal to you. Pick or draw a digit, then add a gradient-crafted perturbation and watch the classifier flip — while the image looks unchanged. This is the family of attack at the centre of my research; here it runs live in your browser.

How it works

The classifier is a small multilayer perceptron — 784 inputs (one per pixel), a 128-unit hidden layer, ten outputs — trained on MNIST to about 98% accuracy. It is deliberately tiny and dependency-free: the weights are a ~150 KB file, and the forward pass is a few dozen lines of hand-written TypeScript. No TensorFlow.js, no ONNX runtime, nothing loaded from a CDN.

The attack is the interesting part. FGSM — the Fast Gradient Sign Method — asks a simple question: for each pixel, which way should I nudge it to make the model more wrong? That direction is the sign of the gradient of the loss with respect to the input, so the whole attack is x′ = x + ε · sign(∇ₓ loss). To compute ∇ₓ I run the same backpropagation the trainer used, but stop one layer early — at the image instead of the weights. PGD (Projected Gradient Descent) iterates that step and projects back into an ε-ball each time; it’s slower but stronger, and it’s the standard benchmark attack in the literature.

Because the perturbation is bounded by ε in the L∞ sense — no single pixel moves by more than ε — a small ε stays invisible to the eye while still crossing the model’s decision boundary. That is the whole unsettling point of adversarial examples: robustness to random noise says nothing about robustness to adversarial noise.

My MSc research applies exactly this idea outside the image domain — to intrusion-detection models on the Controller Area Network bus, where an attacker perturbs network features rather than pixels to slip malicious traffic past a deep-learning detector, and where adversarial training (folding these very examples back into training) is one of the defences. Digits are just the version you can see. Honest caveats: this is a plain MLP, not a hardened or adversarially-trained model, and a hand-drawn digit is normalised to roughly match MNIST — so treat it as an intuition pump, not a benchmark.