Driving GUIs with text LLMs

10 May 2025

Adam Jesionowski

Large Language Models show great promise for automated software testing. They are particularly good at roleplay - one can create a prompt that causes the LLM to act as a regular customer, or a malicious user that tries to break the application. Rather than creating long integration tests that largely become change detectors, LLM-driven testing can actually operate as a test user and surface issues accordingly. However, when it comes to testing graphical user interfaces using multi-modal LLMs, the Vision Language Models are Blind paper demonstrates that vision models don’t see things in the way we see things.

What I’ll demonstrate here is an alternative route to testing GUIs: rendering them as text so that LLMs can consume them directly. This should eliminate a lot of the ambiguity that comes with image processing – and it sets the stage for an even more powerful method of presenting a multi-modal LLM with both text and visual representations at the same time.

Here I interact with a simple counter application through an LLM that sends commands to the GUI:

The application built is built on top of Iced, a GUI framework that implements the Elm architecture. This is a particularly good fit for this task, as the state of the application provides a single source of truth from which the view is always re-built from. Updates to the state are done through message passing: a button click generates a message that the update method handles to mutate the state.

// The application state
struct Counter {
    value: i32,
    some_text: String,
}

// Messages that can update the state
enum Message {
    IncrementPressed,
    DecrementPressed,
    SetText(String)
}

// The update function
impl Counter {
    fn update(&mut self, message: Message) {
        match message {
            Message::IncrementPressed => self.value += 1,
            Message::DecrementPressed => self.value -= 1,
            Message::SetText(s) => self.some_text = s
        }
    }
    fn view() -> iced::Element {
        // Create the GUI based on the state.
    }
    fn text_view() -> String {
        // Create the text representation of the UI based on the state.
    }
}

To drive this GUI remotely means creating an RPC server that can receive messages that can be translated into iced::Messages and consumed. To do this I created a Cap’n Proto schema that the UI implemented:

enum UiMessage {
  incrementPressed @0;
  decrementPressed @1;
  textInputChanged @2;
}

struct WidgetAction {
  id @0 :Text;
  action @1 :UiMessage;
  textValue @2 :Text;
}

interface CounterService {
  getTextUi @0 () -> (uiText :Text);
  sendMessage @1 (message :WidgetAction) -> ();
}

To make the UI accessible to LLMs, I created an llm_widgets crate. These widgets wrap Iced’s native widgets and add LLM-specific metadata. These have a similar interface to the native widgets, but capture all the values relevant to printing them as text. A Button has these methods to either represent itself as text, or turn it into the native widget:

    pub fn as_llm_text(&self) -> String {
        let status_text = if self.config.enabled { "enabled" } else { "disabled" };
        format!(
            "[Button: \"{}\" (ID: \"{}\", Status: {}, Action: {})]",
            self.config.label,
            self.config.id,
            status_text,
            self.config.action
        )
    }

    pub fn into_element(self) -> Element<'a, Message, Theme, Renderer> {
        let mut iced_button = iced::widget::Button::new(...);
        iced_button.into()
    }

With this in place, we can print out the state of the counter UI like this:

Current UI State: [Button: "Increment" (ID: "increment_button", Status: enabled, Action: IncrementPressed)] [Text: "5"] [Button: "Decrement" (ID: "decrement_button", Status: enabled, Action: DecrementPressed)] [TextInput: "cinq" (ID: "info_input", Placeholder: "Info", Status: enabled, OnInput: "Updates the info text", OnSubmit: "Submits the current input.")]

The final bit is instructing the LLM on how to drive the application. I used a template that looks like this:

You are an AI assistant interacting with a textual representation of a GUI.
Your goal is: {{user_task_description}}

The current UI state is represented as a string. The UI contains the following types of elements:

1. Buttons: Shown as [Button: "label_text" (ID: "widget_id", Status: enabled/disabled, Action: LLM_ACTION_STRING)]
2. Text Inputs: Shown as [TextInput: "current_value" (ID: "widget_id", Placeholder: "placeholder_text", Status: enabled/disabled, OnInput: "action_description", OnSubmit: "action_description")]
3. Text Displays: Shown as [Text: "displayed_text"]

Action History:
{{action_history}}

Current UI State:
{{current_ui_text}}

To interact with an enabled element, you must respond with its exact action string (either LLM_ACTION_STRING for buttons, or OnInput/OnSubmit action for text inputs).
If an action requires parameters, they will be provided in the action description.

If the goal is achieved, or you believe it cannot be achieved, or there are no suitable actions, respond with only the word "END".

IMPORTANT: Respond with ONLY the action string or "END". Do not include any JSON, markdown, or other formatting.
Examples of correct responses:
- IncrementPressed
- DecrementPressed
- Updates the info text: Hello World
- END

Do not include any explanations, quotes, or additional formatting in your response.

This could be improved by using function calling and/or developing a stricter grammar, but it worked for this case.

A counter_driver application was built to let the user set a task, call an LLM, parse the response, and send the right message to the counter application. As you saw above, this means we can take arbitrary test actions, including ones that would be impossible to code directly, like asking for a translation at execution time.

While I did not implement it in this example, the next step is to start surfacing bug reports and other types of criticism that the LLM encounters as they work. For example, we would expect an LLM to provide the following output for an improperly constructed counter example if properly prompted:

Perception:
There are buttons for + and -.

Tasking:
* Click the + button

Action:
IncrementPressed

// The UI changes, with the counter value going down.

Perception:
!!! Despite clicking +, the number went down.

Tasking:
* Try clicking the - Button instead.

Action: (2) Click

// The UI changes, with the counter value going up.

Perception:
The number was incremented.
!!! The buttons do the opposite of what they are labeled as.

END

This feedback can be raised up to the developer to fixed, or even fed into LLM-based coding workflows.

This is extremely exciting to me as it means the possibility of highly parallel, highly available, and high coverage automated user testing.

Back to index