Testing ChatGPT-4 for ‘UX Audits’ Reveals an 80% Error Rate & 14–26% Discoverability Rate

KEY TAKEAWAYS Baymard conducted extensive testing to assess ChatGPT-4’s capability to perform UX audits on 12 different webpages. The assessments involved comparing the AI model’s UX recommendations to those of a qualified human UX professional.

The test results unveiled an 80% false-positive error rate and a 20% accuracy rate in the UX suggestions provided by ChatGPT-4.

When pitted against human experts, ChatGPT-4 successfully identified 26% of the UX issues in the webpage screenshot but only 14% of the actual UX issues on the live webpage. This discrepancy arises from the fact that interaction-related UX issues cannot be deduced from a static image.

On average, across the 12 webpages tested, ChatGPT-4 correctly identified 2.9 UX issues, but overlooked 18.5 UX issues on the live webpage and 9.4 UX issues in the webpage screenshot. It also generated 1.3 suggestions that could potentially harm UX and 10.6 suggestions that proved unhelpful when compared to recommendations made by human UX professionals.

The testing incorporated six highly trained UX benchmarkers from Baymard who relied on over 130,000 hours of extensive UX research.

Why This Test?

OpenAI recently enabled image uploads in ChatGPT-4, allowing users to submit webpage screenshots and request recommendations for UX improvements. While this feature initially appeared promising, with responses tailored to the uploaded screenshots and conveyed with a tone of high confidence, Baymard decided to conduct a rigorous evaluation to determine the accuracy of ChatGPT-4’s UX issue detection on webpages.

Test Methodology

  • One of 12 webpage screenshots was analyzed (full version available here).
  • ChatGPT-4’s response for the uploaded webpage screenshot was obtained.
  • ChatGPT’s recommendations were compared to those of a human UX professional who spent 2-10 hours analyzing the same webpage.

The Results

  • The analysis of the 12 pages yielded the following discovery, accuracy, and error rates for ChatGPT-4:
    • 14.1% UX discovery rate overall (on the live webpage)
    • 25.5% UX discovery rate for issues visible in the screenshot
    • 19.9% accuracy rate for ChatGPT’s suggestions
    • 80.1% false-positive error rate for ChatGPT’s suggestions (overall)
    • 8.9% false-positives where ChatGPT’s suggestions could be potentially harmful
    • 71.1% false-positives where ChatGPT’s suggestions were unhelpful

GPT-4 Discovers 26% of UX Issues in the Screenshot, and 14% of UX Issues on the Webpage

The tests reveal that ChatGPT-4 detected 26% of the UX issues present in the webpage screenshot when compared to a human UX professional. To understand how a human UX professional’s performance compares to the “ChatGPT-4 screenshot” approach, one must consider all UX issues identified by the human using the live webpage. In this context, ChatGPT-4 found 14% of the UX issues actually present on the live webpage because it exclusively analyzed screenshots, whereas human UX professionals interacted with the live website.

Uploading screenshots and requesting an AI model to assess them inherently limits the ability to detect interactive UX issues. Discovering many UX issues necessitates interaction with the webpage, such as clicking buttons and hovering over images. Furthermore, it requires navigating between pages and considering information from other pages when evaluating the current page.

The 80% Error Rate: 1/8 Is Harmful, and 7/8 Is a Waste of Time ChatGPT-4 exhibited an 80% false-positive error rate, with approximately 1/8 of these erroneous suggestions potentially causing harm to UX. For example:

  • Suggesting further simplification of LEGO’s already simplified footer.
  • Recommending that Overstock, which uses pagination, adopt infinite scrolling or “load more,” which could harm UX.

The majority of ChatGPT-4’s erroneous UX suggestions (7/8) were not harmful but rather a waste of time. These suggestions often appeared generic, based on elements not visible in the screenshot. Examples include recommending “mobile responsiveness” despite being provided with a desktop screenshot and suggesting features that the website already had in place.

ChatGPT-4 also occasionally provided outdated suggestions that no longer align with contemporary UX best practices due to observed changes in user behavior.

In Summary: ChatGPT-4 Is Not (Yet) Useful for UX Auditing

While large language models like ChatGPT have proven invaluable for various tasks in the UX domain, such as analyzing customer support emails, brainstorming sales copy, and transcribing videos, they fall short when it comes to performing UX audits. ChatGPT-4, in particular, exhibited limited discoverability of UX issues and low accuracy in its suggestions, making it an unsuitable choice for UX auditing.

Given ChatGPT-4’s inability to identify a significant portion of UX issues and its low accuracy rate, it cannot provide a genuinely useful supplemental tool for UX audits. Instead, it would require human intervention to parse its responses.

Considering the expenses associated with implementing website changes, relying on a low-quality UX audit by ChatGPT-4 is likely to yield a poor return on investment.

See the full article here: https://baymard.com/blog/gpt-ux-audit

Share your love

Have questions or want to book a demo?

Email:
Address:
4 Capricorn Centre
Basildon
Essex
Great Britain
SS14 3JJ
Here is where to start: https://wlwfuture.com/brand-website-and-marketing-diagnosis-tool/

You enter your website and a few details about your business.
We run a structured analysis across how your brand, website and marketing are performing. This covers messaging, user experience, technical performance, visibility in search and AI search, and how your current activity is likely to be converting.

You receive a clear breakdown of what is working, what is underperforming and where the biggest opportunities are.
No guesswork. No generic advice. Just a practical view of where performance is being lost and how it can be improved.

We will show you how we would fix it. If not, you still leave with a clear understanding of where you stand.
This depends on how you engage with WLW. Daily website tasks start from £25, and entry-level advertising channel management starts from £999 per month.
WLW FUTURE is UK-based, but we deliver projects and campaigns globally through our core team and wider associate network. We regularly support work across Europe and other international markets, depending on the scope and requirements.

We also have a presence in Lyon, which acts as a base for France and wider European activity.

Our team operates across multiple languages including French, Hindi, Chinese, Russian, Somali, Spanish, Norwegian and Swedish, allowing us to deliver campaigns that are properly adapted to local markets rather than simply translated.
No. Campaign work can be run on a pay-as-you-go basis. For project work, we agree scope and pricing upfront, but we avoid unnecessary long-term contracts.
Yes, in selected cases. If there is clear commercial potential, we are open to structured commission or performance-based agreements, subject to a fair agreement.
Yes. This usually happens when execution is restricted by requirements that go against how marketing and advertising actually work. We are always clear that results are not always immediate. Some campaigns take time to build momentum, and that is normal.
“We either save you money or make you money.”

WLW FUTURE is a creative media and operational hybrid agency. Every project starts lean and remote, led by one senior specialist. If more depth is needed, we scale from within our 20+ core team to deliver end-to-end. For larger or more complex programmes, we extend into a trusted partner network.

The result is a joined-up delivery model with fewer layers, faster execution and better commercial focus.
We work across paid social including Meta, TikTok, YouTube, Pinterest and Snapchat, as well as online publications. Paid media covers native, programmatic, ATL, direct buys, display and SEM. We also focus on AEO, which is query-led visibility across AI and search environments.

Our work includes digital PR and affiliate activity such as outreach, influencer partnerships and affiliate marketing. We plan and deliver content strategy across themes, formats and campaigns, supported by creative production including video, animation, graphics and campaign assets.

Performance is tracked through cross-channel analytics, ROI measurement and reporting, alongside audience data collection and insight analysis. We also manage and grow organic social across all major platforms.
Yes, it is one of our core channels and one of the closest things to word of mouth when done properly. We have been working in this space since 2014, combining strategy, content and automation to drive consistent results.

We use the latest platforms and technologies to maximise performance, from list growth and segmentation through to campaign optimisation and lifecycle automation, ensuring email becomes a reliable and scalable revenue channel rather than just a broadcast tool.
We usually enter at the point where a business knows something is not right.

Before a website rebuild or platform investment.
When marketing spend is increasing without clear returns.
When entering new markets or repositioning.
When growth has plateaued or become inefficient.
When internal teams need clearer structure and direction.

From there, we connect the pieces and build forward.

Newsletter signup

Enter your email to receive exclusive research and insights from the WLW FUTURE team. By signing up you agree to our privacy policy.
Subscription Form (#3)

Have more questions?

Let’s schedule a short demo so you can see how we can work together.