Can AI Build Accessible Websites? A New Benchmark Reveals Surprising Results

Two Ways to Think About LLM Accessibility

The conversation around LLMs and accessibility can be viewed through two distinct lenses:

1. Can disabled users access LLM tools?

This asks whether a person with disabilities can actually use interfaces like ChatGPT or Claude. Are the apps themselves built with screen readers, keyboard navigation, and other assistive technologies in mind?

2. Do LLMs generate accessible code?

This is the question Joe Devon’s benchmark addresses: when you ask an LLM to build a website, will that website be usable by everyone? This is where the AIMAC (AI Model Accessibility Checker) benchmark comes in.

The Benchmark Results

The AIMAC benchmark tests the accessibility score of websites generated by various LLMs. The findings are notable:

OpenAI models score highest across the benchmark
Gemini 3 Pro ranks last among the top 10 models tested

This matters because developers increasingly rely on AI-generated code. If the leading models produce inaccessible websites by default, we’re potentially scaling exclusion.

About Joe Devon

Joe is the co-founder of Global Accessibility Awareness Day (GAAD) and Chair of the GAAD Foundation. His path to accessibility advocacy is interesting: he started as a developer, worked on early search engines, built the backend for americanidol.com, and founded a dev shop that grew to ~100 employees. Now he’s focused on ensuring AI doesn’t leave disabled users behind.

Why This Matters

As AI-generated code becomes more prevalent, benchmarks like AIMAC give us concrete ways to measure and improve LLM performance on accessibility. The gap between top and bottom performers suggests there’s significant room for improvement, and that model choice actually matters for building an inclusive web.