Testing ChatGPT-4 for ‘UX Audits’ Reveals an 80% Error Rate & 14–26% Discoverability Rate

KEY TAKEAWAYS Baymard conducted extensive testing to assess ChatGPT-4’s capability to perform UX audits on 12 different webpages. The assessments involved comparing the AI model’s UX recommendations to those of a qualified human UX professional.

The test results unveiled an 80% false-positive error rate and a 20% accuracy rate in the UX suggestions provided by ChatGPT-4.

When pitted against human experts, ChatGPT-4 successfully identified 26% of the UX issues in the webpage screenshot but only 14% of the actual UX issues on the live webpage. This discrepancy arises from the fact that interaction-related UX issues cannot be deduced from a static image.

On average, across the 12 webpages tested, ChatGPT-4 correctly identified 2.9 UX issues, but overlooked 18.5 UX issues on the live webpage and 9.4 UX issues in the webpage screenshot. It also generated 1.3 suggestions that could potentially harm UX and 10.6 suggestions that proved unhelpful when compared to recommendations made by human UX professionals.

The testing incorporated six highly trained UX benchmarkers from Baymard who relied on over 130,000 hours of extensive UX research.

Why This Test?

OpenAI recently enabled image uploads in ChatGPT-4, allowing users to submit webpage screenshots and request recommendations for UX improvements. While this feature initially appeared promising, with responses tailored to the uploaded screenshots and conveyed with a tone of high confidence, Baymard decided to conduct a rigorous evaluation to determine the accuracy of ChatGPT-4’s UX issue detection on webpages.

Test Methodology

One of 12 webpage screenshots was analyzed (full version available here).
ChatGPT-4’s response for the uploaded webpage screenshot was obtained.
ChatGPT’s recommendations were compared to those of a human UX professional who spent 2-10 hours analyzing the same webpage.

The Results

The analysis of the 12 pages yielded the following discovery, accuracy, and error rates for ChatGPT-4:
- 14.1% UX discovery rate overall (on the live webpage)
- 25.5% UX discovery rate for issues visible in the screenshot
- 19.9% accuracy rate for ChatGPT’s suggestions
- 80.1% false-positive error rate for ChatGPT’s suggestions (overall)
- 8.9% false-positives where ChatGPT’s suggestions could be potentially harmful
- 71.1% false-positives where ChatGPT’s suggestions were unhelpful

GPT-4 Discovers 26% of UX Issues in the Screenshot, and 14% of UX Issues on the Webpage

The tests reveal that ChatGPT-4 detected 26% of the UX issues present in the webpage screenshot when compared to a human UX professional. To understand how a human UX professional’s performance compares to the “ChatGPT-4 screenshot” approach, one must consider all UX issues identified by the human using the live webpage. In this context, ChatGPT-4 found 14% of the UX issues actually present on the live webpage because it exclusively analyzed screenshots, whereas human UX professionals interacted with the live website.

Uploading screenshots and requesting an AI model to assess them inherently limits the ability to detect interactive UX issues. Discovering many UX issues necessitates interaction with the webpage, such as clicking buttons and hovering over images. Furthermore, it requires navigating between pages and considering information from other pages when evaluating the current page.

The 80% Error Rate: 1/8 Is Harmful, and 7/8 Is a Waste of Time ChatGPT-4 exhibited an 80% false-positive error rate, with approximately 1/8 of these erroneous suggestions potentially causing harm to UX. For example:

Suggesting further simplification of LEGO’s already simplified footer.
Recommending that Overstock, which uses pagination, adopt infinite scrolling or “load more,” which could harm UX.

The majority of ChatGPT-4’s erroneous UX suggestions (7/8) were not harmful but rather a waste of time. These suggestions often appeared generic, based on elements not visible in the screenshot. Examples include recommending “mobile responsiveness” despite being provided with a desktop screenshot and suggesting features that the website already had in place.

ChatGPT-4 also occasionally provided outdated suggestions that no longer align with contemporary UX best practices due to observed changes in user behavior.

In Summary: ChatGPT-4 Is Not (Yet) Useful for UX Auditing

While large language models like ChatGPT have proven invaluable for various tasks in the UX domain, such as analyzing customer support emails, brainstorming sales copy, and transcribing videos, they fall short when it comes to performing UX audits. ChatGPT-4, in particular, exhibited limited discoverability of UX issues and low accuracy in its suggestions, making it an unsuitable choice for UX auditing.

Given ChatGPT-4’s inability to identify a significant portion of UX issues and its low accuracy rate, it cannot provide a genuinely useful supplemental tool for UX audits. Instead, it would require human intervention to parse its responses.

Considering the expenses associated with implementing website changes, relying on a low-quality UX audit by ChatGPT-4 is likely to yield a poor return on investment.

See the full article here: https://baymard.com/blog/gpt-ux-audit