Precision is everything in accounting and finance, but that tenet is being tested by the growing use of generative AI (GenAI). One big risk is AI hallucination—where AI systems generate incorrect or misleading information. And mitigating that risk will become paramount as financial professionals increasingly rely on AI for data analysis, forecasting, and decision making. So, let’s explore how hallucination risks can happen, what you can do to reduce them, and how to ensure that using GenAI won’t compromise accuracy and reliability.

 

What Is AI Hallucination?

 

AI hallucination can occur across various GenAI applications, including natural language processing (NLP) based on large language models (LLM) and image generation. GenAI models predict outputs based on patterns learned from training data. When an AI model hallucinates, the content it generates might seem plausible even though it’s actually nonsensical or factually incorrect. Unfortunately, AI hallucination can be difficult for many users to detect.

 

Sometimes hallucination can be helpful for being creative and generating new ideas. One example is developing mnemonics for students to use when learning accounting (see ChatGPT and AI in Accounting Education and Research). Another example would be finding a creative way to write code or completing a task in a novel way. In these examples, the goal is to do something different rather than replicate the past.

 

However, for many business tasks, GenAI output must be grounded in facts or existing knowledge, and hallucination can lead to inaccuracies that result in faulty decision making, damaged reputation, regulatory penalties, and lost opportunities. Take the now infamous case of hallucination involving a legal brief submitted to a federal judge that cited six fake cases. ChatGPT, which generated the text, believed the cases would be helpful for the legal briefing, so it fabricated them. Similarly, a 2023 research study from the Cureus Journal of Medical Science found that only 7% of ChatGPT’s references to medical articles were authentic and accurate.

 

Ensuring AI reliability and accuracy is crucial to maintaining the integrity of business operations and services. AI hallucinations can spread misinformation that risks causing confusion, misinformed decisions, and harm. And that’s a major problem for any industry or organization leaning into AI capabilities.

 

Needless to say, hallucinations can be especially damaging in accounting and finance applications because they can impact the precision and accuracy of data and conflict with standards set by centralized authorities. For example, financial statements must be prepared in accordance with U.S. Generally Accepted Accounting Principles (GAAP) or International Financial Reporting Standards (IFRS), tax planning and filings must adhere to tax laws and opinions, and audits must follow Generally Accepted Auditing Standards (GAAS). Any GenAI output not grounded in these rules is unreliable. Similarly, most managerial accounting analyses must be grounded in accurate historical financial data.

 

From prompt engineering to retrieval-augmented generation (RAG), we’ll delve into strategies that harness the power of GenAI while safeguarding against potential pitfalls.

 

How Users Can Mitigate Hallucination Risk

 

GenAI tools already offer several ways to mitigate hallucination risk. However, users aren’t always familiar with the best methods for interacting with LLMs, despite GenAI collaboration quickly becoming a must-have skill. The following approaches can help users get ahead of hallucination risk.

 

Setting the precision parameters. Some chatbots for LLMs allow users to toggle settings or specify their preferences between creativity and precision to tailor responses as needed. In settings focused on creativity, GenAI is more likely to produce original, imaginative, or unconventional responses, content ideal for brainstorming or creative writing tasks. Alternatively, when precision is prioritized, GenAI favors a technical writing style, which prioritizes factual accuracy and detail-oriented responses.

 

To mitigate the risk of hallucinations, especially when factual accuracy is crucial, users can choose the LLM’s precision mode. This instructs the GenAI to rely more on verified information and logical reasoning to help it prioritize accuracy and reliability over creativity. By selecting precision, users guide the GenAI to use data and facts it has been trained on, thereby reducing responses based on less reliable, synthesized, or imaginative content. Accounting and finance professionals can particularly benefit from this choice.

 

Prompt engineering. Simply asking for accuracy while prompting an LLM can decrease hallucination risk. Researchers at Johns Hopkins University found that “according-to” prompts significantly reduced hallucination. For example, adding “according to the IFRS codification” for financial reporting tasks or “according to the U.S. tax code” for tax planning or preparation tasks can help ground the LLM output in the relevant information sources.

 

When a specific source isn’t known to the user, including phrases like “based on verified information” or “provide sources for your claims” can help guide the LLM behind the GenAI to focus on accuracy and reliability. Structuring prompts to ask for data-supported answers can also encourage the model to stick to known facts. These approaches reduce the chance of generating misleading or incorrect information.

 

Checking the references provided. Another way users can mitigate the risk of hallucinations is by checking the references an LLM provides. When a model like ChatGPT cites sources, users should follow these steps:

1.     Verify the sources: Check the credibility and relevance of any sources cited to ensure the information is accurate and updated.

2.     Cross-reference information: Look for additional reputable sources that cite consistent information to confirm the information provided is reliable.

3.     Consider the context: Understand the context in which the information was presented to ensure that it’s applied correctly in your situation.

 

Uploading documents and files. Users can significantly reduce the risk of hallucinations by using the document and file upload feature offered by most LLMs to provide specific factual context. Uploading documents with detailed background information or specific data allows the GenAI to tailor its responses based on the provided context, enhancing accuracy. For example, while doing tax planning or preparation, users can upload a document containing a portion of the Internal Revenue Code (IRC) to ensure the LLM uses accurate information. Similarly, they might upload a white paper released by an accounting firm that contains information about implementing a new accounting standard when deciding what accounting method is appropriate for a specific transaction. However, take note that any proprietary data should only be uploaded to an LLM hosted in a secure, private environment. Additionally, most LLMs have limits on the number and size of files that can be uploaded. For example, OpenAI limits users to 20 files that are at most 512 MB each.

 

How Organizations Can Help Mitigate Hallucination Risk

 

Other methods for reducing hallucination risk require an organizational investment in developing LLM technology. Once it’s established, users can easily interact with the tech in a more trustworthy way. Here are two possible organizational-level solutions that can mitigate hallucination risk.

 

Fine-tuning. This is a process where an AI model, pre-trained on a large data set, is further trained on a smaller, specialized data set, adapting it to specific tasks or improving its performance in certain areas. This can reduce hallucinations by focusing on targeted learning and adapting to specific domains. By fine-tuning the model on high-quality, accurate, and relevant data, it learns to produce outputs more aligned with the factual nuances of new data, helping to mitigate the generation of inaccurate or irrelevant information. Additionally, for applications in critical domains like medicine, law, or finance, fine-tuning with domain-specific data ensures the model understands and adheres to the factual standards required, making it more reliable and accurate.

 

An example of a fine-tuned AI model in accounting is ChatCPA. The platform is a GenAI model that was fine-tuned specifically for tasks relevant to CPAs (Certified Public Accountants) by training it further on a data set comprised of accounting principles, tax regulations, and financial reporting standards. This specialized training of the chatbot has equipped the AI to accurately interpret and apply complex accounting terminology and regulations, making it adept at assisting with tax preparation, financial analysis, and compliance checks. The result is a highly specialized tool designed for the precise needs of CPAs, offering services that reflect the latest accounting industry standards.

 

Organizations can fine-tune an LLM on any text they want the model to use when generating output, including internally generated documents (e.g., emails or training materials) or external sources of information (e.g., the IRC).

 

RAG. This combines the power of information retrieval with natural language generation to enhance an LLM’s output quality. It works by first retrieving relevant documents or data from a knowledge base using the input query. Then the AI uses this retrieved information to generate responses that are more informed, accurate, and contextually relevant. (For a more technical explanation of RAG, see What are AI hallucinations & how to mitigate them in LLMs). Essentially, RAG transforms words into ordered lists of numbers (i.e., vectors) that the AI can very quickly read and evaluate for relevance, allowing the AI to search and retrieve information before responding. Not only does this give the AI access to information that isn’t in its training data, but it also constrains the information source that the AI uses when responding to users’ prompts.

Although creating and implementing a RAG requires some technical knowledge, it’s easier to do than fine-tuning. Looking up information significantly improves GenAI’s ability to provide detailed and accurate answers. For instance, research published by Cornell University found that a RAG can reduce hallucination when using an LLM to extract information from earnings report transcripts.

 

In the context of managerial accounting, a RAG system can reduce hallucinations by pulling in real-time, relevant financial data, regulations, and industry standards from a vast database before generating its response. When tasked with queries about cost estimation, budget planning, or financial analysis, the RAG system first retrieves the most up-to-date and pertinent information. Then, it synthesizes this information to provide answers that aren’t only grounded in current practices and data but also tailored to the specific context of the query. This process ensures the generated responses are accurate, reducing the likelihood of incorrect or misleading information based on outdated knowledge or misconceptions. Ongoing research suggests that RAGs can be combined with fine-tuning for a potentially synergistic effect.

 

As the adoption of GenAI in accounting and finance accelerates, it’s natural for professionals to marvel at its remarkable capabilities. But the inherent challenges and risks of AI hallucination is a prominent hurdle that threatens the precision and reliability of financial decision making, forecasting, and reporting. By adopting a multifaceted approach to mitigate these risks, businesses can harness the benefits of GenAI. The combination of our proposed user and organizational methods can help businesses ensure that they’re making safe, strategic choices without compromising accuracy and trust.

About the Authors