Smartflow: AI-Driven RPA

SmartFlow

Robotic Process Automation using LLMs

Arushi Jain, Shubham Paliwal, Monika Sharma, Lovekesh Vig, Gautam Shroff

Overview

Robotic Process Automation (RPA) systems face challenges in handling complex processes and diverse screen layouts that require advanced human-like decision-making capabilities. These systems typically rely on pixel-level encoding through drag-and-drop or automation frameworks such as selenium to create navigation workflows, rather than visual understanding of screen elements. In this context, we present SmartFlow, an AI-based RPA system that uses large language models (LLMs) coupled with deep-learning based image understanding. Our system can adapt to new scenarios, including changes in the user interface and variations in input data, without the need for human intervention. SmartFlow uses computer vision and natural language processing to perceive visible elements on the graphical user interface (GUI) and convert them into a textual representation. This information is then utilized by LLMs to generate a sequence of actions that are executed by a scripting engine to complete an assigned task. To assess the effectiveness of SmartFlow, we have developed a dataset that includes a set of generic enterprise applications with diverse layouts, which we are releasing for research use. Our evaluations on this dataset demonstrate that SmartFlow exhibits robustness across different layouts and applications. SmartFlow can automate a wide range of business processes such as form filling, customer service, invoice processing, and back-office operations. SmartFlow can thus assist organizations in enhancing productivity by automating an even larger fraction of screen-based workflows.

Proposed System Design

We begin by identifying the various task-execution responsibilities that are distributed among four different user classes, namely the End-user, Information Validation System, Admin, and the SmartFlow.

End-user: provides all the necessary information, such as task request data, via email or chat-bot to the application.
Information Validation System (IVS): ensures that the task-request received from the end-user includes all the required information for filling in the data fields necessary to complete the task in the application and adds the task-request to the incoming task directory.
Admin: is responsible for setting up and configuring SmartFlow for a specific application. This involves providing meta-data such as the website URL and HTML source code for all its pages. The Admin also performs layout mapping, which associates visible field names on the application screen with their respective edit-fields and data-hints. We propose two vision-based methods for automatic layout mapping, which are validated by the Admin.
SmartFlow API: is responsible for sequentially handling task requests from the incoming requests directory. Upon completion, the API returns the task's output status (e.g., success, failure, or errors) to the task status directory.

SmartFlow Algorithm

The SmartFow algorithm begins processing as follows, by sequentially extracting and handling requests from the incoming request queue.We begin by identifying the various task-execution responsibilities that are distributed among four different user groups, namely the End-user, Information Validation System, Admin, and the SmartFlow.

Pre-processing: It cleans the HTML source code provided in the input metadata (application URL, HTML source code and layout mapping), ensuring it meets the size limit of Large Language Models. This involves removing unnecessary attributes and classes that could hinder LLM processing.
Form Elements Extraction: The cleaned HTML source code is input into Large Language Models (LLMs) like GPT-3 and ChatGPT to extract field names and types using the prompt as shown in above figure. Meanwhile, a screenshot image of the application is captured, and text-regions are extracted using EasyOCR. The extracted information is merged with the layout mapping to create a Mapping List, which includes field names, types, and coordinates.
Navigation Workflow Generation: In this step, the Mapping List and task-request are given as input to the LLM with a prompt to generate PyAutoGUI code. This code determines the sequence of actions, including clicking on the correct form-field, to complete the task-request accurately. The precision is crucial to avoid incorrect form submission. The algorithm executes micro-level steps with high accuracy and handles different field types such as date pickers, dropdown menus, upload buttons, radio buttons, and checkboxes using vision-based algorithms invoked by the LLM.
Handling Multi-page Form Submission: After executing the navigation workflow, SmartFlow captures another screenshot to handle multi-page forms effectively. By leveraging visual cues from the website's layout, it recognizes the continuation of the form and sequentially processes the user's requests to fill in any remaining fields.
Determining the status of executed task-requests: SmartFlow employs a frame difference technique to extract feedback messages related to the success, failure, or errors encountered during form submission. These messages, obtained using a text-extractor, can address network connectivity, missing fields, or successful submissions. By logging these messages into a status queue, SmartFlow facilitates analysis and improves the user experience.

Dataset Details

To evaluate the effectiveness of integrating vision and large language models (LLMs) in RPA systems, we created the RPA-Dataset. It consists of five web applications, each with five diverse layouts and five user-task requests. The RPA-Dataset includes the source HTML codes of the applications, along with ground-truth annotations for tasks such as OCR (Optical Character Recognition), Layout Mapping, filling data fields, and handling complex fields like dropdowns, datepickers and radio-buttons/checkboxes. The applications cover domains as following:

Overview

Proposed System Design

SmartFlow Algorithm

Dataset Details

Demo Video