When generative AI (GenAI) tools were first introduced, they were primarily seen to be useful for generating textual first drafts and graphics, but not as useful in the quantitative types of work done by most accountants. Many accountants breathed a collective sigh of relief when OpenAI’s GenAI tool ChatGPT 3.5 failed to pass a simulated CPA (Certified Public Accountant) exam. Nevertheless, once ChatGPT 4 was released, a follow-up study showed that it could pass a simulated CPA exam and the CMA® (Certified Management Accountant), Certified Internal Auditor (CIA), and Enrolled Agent (EA) exams as well.

As we await the upcoming release of the next version of ChatGPT—ChatGPT 5—it’s useful to explore the strengths and weaknesses of GenAI and examine the training techniques used to improve its accounting exam accuracy.

AI Terminology Glossary

 

Generative AI: a type of AI technology that allows users to formulate a written or spoken query in a human language and receive a response in the form of text, audio, images, or computer code.

 

Large language model (LLM): a type of AI model that has been trained on large amounts of data and is able to understand questions and produce responses using human languages.

 

Hallucinations: incorrect responses generated by AI.

 

Temperature: a GenAI setting that allows the user to indicate on a scale the preferred precision of a response to a query from most precise to most creative or somewhere in between.

 

Zero-shot training: using a GenAI tool as is without providing any additional training to the AI model.

 

Few-shot training: using a GenAI tool after providing limited additional training before using it.

Training Tools

Out-of-the-box GenAI tools come pre-trained, that is, they have been fed large amounts of documents or text to create their large language model (LLM). These LLMs can then use predictive analysis to answer questions (prompts) by generating answers based on the model’s prediction of the best response word for word based on the question asked. We don’t know exactly with what information the base LLM has been pre-trained, but, based on the documented depth of pre-training for LLMs, we can see that some accounting knowledge has been sourced from existing internet sources and/or accounting textbooks.

It’s important to note that neither ChatGPT 3.5 nor ChatGPT 4 were able to pass any of the professional exams straight out of the box. In the terminology of LLMs, this is what is meant by zero-shot training, that is, no training of the model has been done prior to use. According to the study, the average zero-shot training scores for ChatGPT 4 for both the CPA and CMA exams were less than 70%, although 82% of the CPA Auditing and Attestation section questions were answered correctly in this testing scenario. It should also be noted that the questions were taken from vendor exam preparation guides and only multiple-choice questions without images embedded in their text were used. Consequently, no working problems that might be closer to real-world scenarios were included.

In the next iteration, few-shot training was used, where the model was trained with 10 random multiple-choice problems with the ChatGPT temperature set to zero. Setting the temperature to zero can be likened to an extreme version of selecting the more precise conversation style in Microsoft Copilot, as opposed to more creative or more balanced. While, by default, ChatGPT attempts to return a balanced response, it usually generates different answers each time even when queried with the same question. Setting the ChatGPT temperature to zero significantly reduces the randomness of responses and increases the probability of receiving consistent answers to the same question. After training with 10 sample questions across four exams, exam scores improved on average by 6.6% on both parts of the CMA exam with passing scores averaging approximately 72%.

Finally, additional training was done on the LLM using the ReAct training technique. The “Re” in ReAct stands for reasoning. By taking larger problems and interacting with the model to break the problem into smaller logical steps, the LLM learns how to reason and solve problems that it couldn’t before. The “Act” in ReAct stands for acting. At certain points in the smaller logical steps, actions may be needed to obtain intermediate data, results, or information to pass to the next step. Most GenAI tools offer the option to interact with application programming interfaces (APIs). Thus, external interfaces may be built in to interact with external tools such as a Google search for current exchange rates or a calculator tool to convert U.S. dollars to euros. Using ReAct has been shown to be an effective way to improve the accuracy of results and reduce hallucinations, or incorrect results. By using the ReAct training technique on top of few-shot training, ChatGPT 4 was able to pass all four of the tested professional exams.

Evaluating the study from a more real-world perspective demonstrates the gap between the headlines and what it actually takes to develop an accounting GenAI model that can assist us day to day. While some day-to-day questions may be multiple choice, most are more likely to resemble the word problems that weren’t tested by this study. Vanilla, out-of-the-box solutions still fall short of successful exam candidate performance even on multiple-choice questions. Creating an advanced model takes an understanding of model training techniques and perhaps integration with external systems and tools. These things are all possible but take time and resources.

Taking on ChatGPT

If you’re still ready to take on the challenge, here are some basic tips. The first step in delivering good results with any GenAI tool is to set the context as clearly as possible. First, identify who you are (e.g., I’m a management accountant in the United States working for a large global life insurance company tasked with…), and second, what specifically do you want to engage with the tool about (e.g., please create a loan amortization schedule based on loan amount, life, annual interest rate, and frequency of loan repayments). It may take rephrasing the question or asking follow-up questions to get the best response from any model, especially straight out of the box.

There may be models that have been pre-trained with data more specific to your needs. For example, ChatGPT now offers a store that allows you to choose more specific models that others have built. Searching the store for accounting already displays many models for accounting-purposed pre-training. However, be sure to check reviews, because anyone with a ChatGPT Plus subscription can post their model to the store.

Finally, customizing your model by importing your corporate documentation and/or interacting with internal and external tools can greatly improve your model’s value to you personally. However, it’s important to understand how your model is using this data and making sure that proprietary information isn’t exposed to those who shouldn’t have access (e.g., don’t train your model with proprietary documents and release your updated model to a GenAI store for the world to access). Also, make sure you understand how your GenAI tool uses any data you supply. Read the fine print.

By examining this study, we can see that GenAI tools and ChatGPT, in particular, continue to advance. There are many excellent uses for these tools out of the box such as idea generation, summarization of large documents, and creating first drafts of reports—assuming that a further layer of human review and critical thinking is applied before sharing the end product.

However, it’s evident from this study of accounting exam question proficiency that it takes additional skill, knowledge, and effort to elevate these tools to the level needed for many accounting functions today. As GenAI matures, directed accounting models are being developed that improve the base functionality available to management accountants. Nevertheless, management accountants need to continue to skeptically examine overly broad claims for these models and engage in the personal learning required to successfully train and interact with these models in order to truly enhance their productivity.

About the Authors