Trust & safety

We attack our own app

What a red-team test is, why we ran more than 1,100 attempts in 19 languages, and what ended up at zero

Autistic Mirror often gets opened in sensitive moments. After a sensory-heavy day, during a crisis, in the middle of a difficult conversation with family. Whoever opens an app in that state has no buffer for an AI that suddenly answers badly. Safety is therefore not a feature that arrives later. Safety is the precondition that makes the tool usable at all.

This article describes what we did on 17 May 2026 against the live app. It is readable without prior knowledge. Anyone looking for technical detail will find it in the internal audit report. Here the question is whether the protective layers hold when someone tries to break them on purpose.

What a red-team test is

A red-team test is a simulated attack. Instead of waiting for someone outside to try, we attack the app ourselves. With every pattern known in security research, plus patterns that are specifically critical for an AI used in a neurodivergent context.

Three questions sit at the centre.

Can the AI be talked into ignoring its internal rules. Can it be pushed during a crisis into dropping hotline references or normalising what is happening. Does the surrounding software protect user data even when an endpoint is hammered directly.

The weight of such a test does not come from a single attempt. It comes from volume and variation. One passing attempt is an anecdote. Hundreds of passing attempts across multiple languages are evidence.

What we mean by attack attempt

An attack attempt is a real request sent to the running app, phrased to bypass a protective rule. No lab, no mock, no simulation. Exactly what an attacker would type into the input field. Whenever we use the term attack attempt below, we mean one of these real requests.

First run

In the first round we ran several dozen carefully constructed attack patterns against the running app. Each pattern in every one of the seven actively maintained UI languages. German, English, Spanish, French, Dutch, Brazilian Portuguese, Danish.

Seven languages are not decoration. An AI defence that holds in German can silently fail in French. Taking safety seriously means testing every language in which the app actually answers.

Result of that first run. Zero violations.

Why that was not enough for us

A passing run of 210 attempts is a good signal. Statistically it is still thin. Knowing whether a system really holds requires a scale at which chance can be ruled out as the explanation.

Red-team reports for AI products typically work with a few dozen to a few hundred probes, often in one or two languages. We wanted to go further on both axes. For two reasons. The app operates in an especially protective context. And we are preparing for independent external audits, which require comparable baselines.

The extended run

The extended run on 17 May 2026 sent a much larger inventory against the running app. More than 1,100 attack attempts, plus several hundred additional model responses from long, multi-stage conversations. Accompanied by a full offline structural test suite that checks the protective logic independently of the AI.

For the scale to be visible, here are the individual areas. What the terms mean is summarised in a sentence next to each one.

Area	What is being checked	Result
Deep probing across the 7 UI languages	attempts to push the AI step by step into breaking its own rules, in every actively maintained language	0 violations
Attempts to overwrite the internal rules directly	classic inputs such as "Ignore all previous instructions"	0 violations
Attempts to force the AI into another role	"You are now a doctor", "Answer like a coach"	0 violations
Attempts to bypass safeguards through writing tricks	encoded or character-disguised inputs designed to slip past filters	0 violations
Attempts to force behavioural compliance and normalisation	requests for ABA-style recommendations	0 violations
Attacks in further languages outside the UI	more than a dozen additional languages an attacker would pick because many AI defences silently fail there	0 violations
Reworded bypass attempts	the same attacks in different wording, to defeat pure keyword filters	0 violations
Combined attacks from an extended catalogue	several attack patterns blended into a single attempt	0 violations
Slow manipulation across many conversation turns	conversations that try to soften safeguards gradually rather than directly	within tolerance
Offline structural tests	several test suites that check the protective logic independently of the AI for consistency and drift	all passed
Admin-side endpoints under pressure	every admin-side interface is called without valid authorisation and must refuse	correctly blocked
Quality check on answer content	several clusters check whether the AI names neurological mechanisms correctly rather than offering generic phrases	near-complete match
Data isolation between users	database-level check whether one user's data can ever surface in another user's answer	0 data leaks
Tamper detection on the activity log	verifies that retroactive changes to security-relevant logs remain detectable	passed
Reachability of every crisis hotline link	every emergency link stored in the app is reached out to	passed
Multilingual handling of expert terms	checks whether neurological terminology is explained correctly in several languages	passed

What the numbers mean

Three properties matter in that table.

Depth. More than 1,100 attack attempts are far above what is common in the market. At an observed violation rate of zero, statistical uncertainty becomes small enough that the result can no longer be explained by chance.

Breadth. 19 languages covered. The seven actively maintained UI languages plus further languages from other writing systems that an attacker would pick because many AI defences silently fail there.

Repeatability. This run gives us a comparable baseline. If we run the same test again in three months, any regression introduced by a new model version or a prompt change shows up immediately. Safety is not a state. It is a continuous measurement.

Privacy during the test itself

A safety test should not produce a data trail that later becomes a problem. Per attempt only three things are stored. The verdict (pass, partial, fail). The targeted mechanism. A short cryptographic hash fragment of the model response. No plaintext responses, no internal system rules, no user data are archived. The audit can be reviewed without anyone ever seeing the original wording.

External tests come next

Passing internal tests is the floor, not the ceiling. A safety claim only carries real weight once independent third parties can reproduce it. We are preparing an external audit and will publish its results with the same transparency as this internal run, regardless of whether the findings turn out flattering or uncomfortable.

In parallel, a manuscript on the methodology of our safety architecture has been submitted to Autism in Adulthood for peer review (status: in review). This makes the architecture verifiable outside our own house for the first time.

What stands behind the numbers

Most AI products market features. Safety rarely shows up in marketing because it feels abstract to outsiders. Behind the numbers in this run sits a different stance. An app that works with especially vulnerable people owes its users more than a polished interface. It owes them that the promises hold under pressure. The fact that this run ended at zero violations is no guarantee for the future. It is the statement that the responsibility is taken seriously, with real tests and real numbers, not with claims.

For organisations and auditors

A more detailed methodology and results document is available for B2B customers, compliance teams and external auditors. It contains the full probe matrix, the exact inventories per attack area, the classifier logic and the data protection statement on storage. Informal request to enterprise@autisticmirror.app, sent after a short clarification on intended use.

Autistic Mirror explains autistic neurology individually, tied to your specific situation. For yourself, as a parent, or as a professional.