AI Grades : Codex 5.3 vs 5.4 vs 5.5 Playwright User Login Test : F by Lance Cleveland ∥ Real-World AI Authority

Recently I’ve been using Codex 5.3 to perform some basic coding tests. The current focus is on crafting new Playwright End-To-End (E2E) tests scripts. Codex 5.3 is only a few months old and was once considered a fully capable AI agent for writing code. Sadly it failed to perform at anything but a mediocre level of output. For this latest iteration of AI testing, I opted to try the newer models of Codex with 5.4 (last month) and 5.5 (last week) as contenders. Codex 5.5 is supposedly the best coding agent available today based on various third party metrics. Better than the prior gold-standard with Anthropic’s release of Claude updates last month.

Our models include:

GPT 5.5 – April 2026
GPT 5.4 – March 2026
GPT 5.3-Codex – February 2026
Gemini 3 Flash Preview – December 2025

Let’s give Codex 5.5 a spin and compare it to the horribly badly outdated by a few weeks versions of 5.3 and 5.4.

The Task

Write a Playwright tests script that logs two different types of users into the Store Locator Plus® SaaS platform.

As with other AI agent grading projects, this project has a standard AGENTS.md and supporting files to pre-train the agent on our environment. The PhpStorm IDE provides additional context from the code and supporting files. There is also a working memory of prior tasks that have been completed that related to the task at hand.

In this case one prior task that the AI agent is aware of was creating an AWS Secrets entry that holds our username and password pairs. This prevents the sensitive data from being baked into our testing scripts. It also allows the exact same data to be shared across platforms including my development system and the QC team’s laptops where they run the full test suite.

The AI agent knows how to fetch the list of usernames and passwords.

For this task it will need to load in that list of user credentials and loop over them to log in each user, check the proper pages loads, log out the user. Rinse and repeat for each user on the list.

In this test we only have two users on the list. No need to add more until we know the AI can handle the basics.

The Prompt

This is the prompt used that is supported by the existing context noted above. The same prompt was given to all three invocations of the user model after flushing the session context so each started from the same starting point.

The AWS setup seems to now be working.
Time to write the Playwright test specification for "existing user login" tests.
We need to fetch the secrets from AWS and make sure each user can login to the SaaS application.
Both the local and staging environment are running and they contain both users listed in our AWS secret.
The app will need to go to the main website URL (base URL) , enter the username (email) and password, login, then make sure the main page comes up.
The main page will be different for superadmin users (noted as a value in the secret JSON payload) than for all other user levels (enterprise, professional, advanced).
For superadmin level users they will see the WordPress multisite network admin dashboard with widgets for "Code Versions", "System Info", and "Debug Log".
For all other users they will see the main SaaS application page (/wp-admin/admin.php?page=csl-slplus) showing the Store Locator Plus Info page with Documentation and News subpanels.

The Results : Grade F for All Three Models

The final results were fairly notable in the fact that the provided solutions and the failure of the app was nearly identical regardless of the AI model being used on the backend. This was surprising enough to send me digging through log files and communication logs to see if the JetBrains AI Assistant was actually in fact changing models. It was, and the latest ChatGPT 5.5 faired no better than GTP 5.4 or GPT 5.3.

That truly surprised me.

In the end, NONE of the models produced a working tests on the first attempt. They all provided a generally acceptable architecture for running the test. The actual Playwright test script appears top be mostly viable, though I’ve yet to test it since none of the models were able to correctly setup the test environment.

The general architecture they all chose was this:

Use a shell script to fire off the Playwright test engine in the self-contained Docker container.
In that shell script, use AWS command line tools (AWS CLI) to use stored credentials to fetch the AWS Secret that contains our username/password list.
Store that username/password list in a JSON variable on the host (my laptop)
Encode the JSON array using B64
Pass the encoded username/password list to the Docker container as a variable named LOGIN_USERS_JSON_B64
Have the test loop over the array of credentials in LOGIN_USERS_JSON_B64

Not only did they all use the same methodology, the naming conventions of the variables was nearly identical.

In addition, every single version failed because it did not properly send the encoded user list into the Docker container properly.

The snippet they all created (for the most part) that broke was this shell script:

export LOGIN_USERS_JSON_B64
LOGIN_USERS_JSON_B64="$(printf '%s' "$LOGIN_USERS_JSON" | base64 | tr -d '\n')"

docker compose --env-file "$ENV_FILE" run --rm test bash -c "npm install && npm run test"

Guess what that does NOT do for the Playwright test engine. It does NOT expose LOGIN_USERS_JSON_B64 to the runtime specifications where the test executes.

As such the list of users is empty and the test fails immediately.

GPT 5.3, 5.4, and 5.5 all failed the same way.

What Did GPT 5.5 Do Differently?

In essence the test specification and launch scripts were virtually identical outside of non-critical variable naming conventions and some minor logic arrangements, neither of which have an impact on the viability of the test scripts.

GPT 5.5 certainly came back with results MUCH faster, in a few minutes versus 10+ minutes for GPT 5.3.

GPT 5.5 also burned more than 5x the amount of tokens as 5.3.

In addition, using the “feed encoded data to Docker” is a horrible design strategy for this type of implementation.

Gemini 3 Flash Preview : Better Idea, Poor Attention To Details : D

As a final litmus tests I changed company and model to Google’s Gemini 3 Flash Preview, a model that is from LAST YEAR and almost 6 months old. How can this possibly work? It is also not highly rated on many code completion metrics for AI agents.

Starting a new session context and switching models to Gemini 3 Flash created an entirely different experience and outcome.

First of all the model stopped several times, fully inspecting the environment and related files. The agent often had other ask permission to run special OS commands (find, mkdir, etc.) on this laptop setup where I’ve provided minimal access to my AI agents running outside of Docker. Nothing it asked for was intrusive, so it was granted permission to read files and make some supporting directories within the project space.

After some back-and-forth on input asking for permissions it crafted an execution strategy and implemented it.

While the actual tests did not function due to some minor coding errors within the actual tests scripts themselves, the overall approach was far superior in my opinion. The Gemini 3 approach:

Store the primary test user AWS key , secret key, and region in an offline (private) environment file.
Within the test specification harness itself, employ the @aws-sdk/client-secrets-manager library to talk to AWS from within the test runner itself.
Fetch the secret directly into the test server loading the username and password list.
Iterate over that array testing the login process.

This solution is far more resilient, less prone to leaking sensitive data and is far less fragile than the GPT5.X solutions that were presented.

Don’t get me wrong, Gemini 3 Flash Preview is not great either. It updated the wrong package.json for loading modules into the playwright runtime environment. It created a fake data set as a fallback, wasting CPU and other resources iterating over a user list that will never exist on the app we are testing. Gemini 3 also create a bunch of guesses at HTML locators making up identifiers versus scanning the application code for REAL WORLD identifiers (another AI “hallucination”). It DID write some comments, which is far better than any GPT 5.X version we use today.

Summary

Granted, this Playwright testing environment is not very common. As such there is limited documentation online about how to run Playwright in a self-contained environment (Docker container) that runs test specifications from a separate code repository that are mounted into the runtime container. While it does create a clean environment for running tests that is separate from the tests themselves, it is NOT something most “just get it done” shops will use. Most shops package their test scripts and runtime environment in a single package run directly on the host (laptop) with no regard for portability or stability of the environment. As such, AI has almost no training and very few design patterns to follow.

AI is awful at figuring shit out if it is not provided with thousands of examples of how to do something. If AI cannot steal the work of thousands of other developers and “change the names to protect the innocent” then it tends to flail about hoping to find a solution that works.

If you are crafting applications and writing code that is just another iteration of the same crap everyone else has already done, AI does a good job of morphing other people’s work into your own special flavor of application. If you are doing something unique or something that requires deeper complex reasoning, AI is either going to cost you a fortune using the latest agentic AI technology with insane iterative processing or you are going to need to do a lot of babysitting and pathing things up.

Every so-often an AI agent will surprise me and doing something useful, complex, and functional in fairly short order. Sometimes AI will solve a coding problem in hours where it would have taken me days. Most of the time, however, AI wasted a lot of time and causes a lot of aggravating as it barfs out subpar technical solutions.

For those that are worried AI is going to take over and nobody will have a job in a few years, I say “give it a minute”. I foresee some notable AI whiplash coming and companies are going to be hiring back a lot of people they fired as they try to fix the mess AI made while burning down the planet and using up all the water in the process.

AI Grades : Codex 5.3 vs 5.4 vs 5.5 Playwright User Login Test : F

The Task

The Prompt

The Results : Grade F for All Three Models

What Did GPT 5.5 Do Differently?

Gemini 3 Flash Preview : Better Idea, Poor Attention To Details : D

Summary

Leave a Reply Cancel reply

Categories

Recent Posts

Topics

Login/Register