Over the past year or so, a large language model (LLM) ChatGPT has demonstrated an uncanny ability to best humans at some of the things that are the cornerstone of our young professional lives.
It has passed all three notoriously difficult exams for medical school, got through the law school bar exam, and passed an MBA exam from the Wharton School of Business at the University of Pennsylvania.
The scores posted by the LLM were modest passing grades. But its later avatar — GPT-4 — is supposedly an even better student than its parent, having sailed through the bar exam with a 90th percentile score and getting near-perfect marks on the GRE Verbal test.
So, it must come as an immense source of both satisfaction and relief for us humans that there is at least one thing that LLMs like ChatGPT are not good at — or in fact terrible at accounting.
Many users of ChatGPT have commented publicly on how the simplest math functions have fixed it. However, there’s a sizeable and rigorously executed study into ChatGPT’s accounting capabilities that Brigham Young University (BYU) professor of accounting David Wood undertook several months ago.
Wood decided to harness the power of the global accounting fraternity via a pitch on social media that solicited help to put ChatGPT through the paces of a global accounting exam of sorts.
There was a deluge of takers: 327 co-authors from 186 educational institutions located in 14 countries participated in the study. They collectively pooled together 25,181 classroom accounting exam questions — as well as 2,000-plus questions from his own department at BYU — to pose to ChatGPT.
Typical of a comprehensive accounting examination, questions ranged across all major topics, such as financial accounting, auditing, managerial accounting, tax, and others, and were of different types (multiple choice, short answers, true/false) and difficulty levels.
The results were unequivocal: ChatGPT clocked a 47.4% result which, in and of itself, was not that bad. Students, however, scored an overall average of 76.7% and easily bested the machine.
According to the study, the LLM did fine on things like auditing. but had trouble getting its artificial neurons around problems that dealt with tax, financial, and managerial assessment problems, according to Wood’s paper — and these were sections that involved a lot of math.
A lot of people can’t quite reconcile AI’s inability to do sometimes even simple math with AI’s fearsome reputation as a potential killer of humanity. Yet the fact is that ChatGPT is essentially a glorified predictive text program — it has been fed vast amounts of data and then trained to identify right and wrong answers.
Its ability to be uncannily humanlike by spitting out conversational answers to questions is because it has been built to understand the patterns inherent in language and the connection between words, but not numbers. (This is why it is called a ‘language‘ model.) The output of these AI LLMs hinges on probability, and not accuracy. Output, by design, has been architected to represent an answer that has the statistically highest probability for the question asked.
Paulo Shakarian, an associate professor at Arizona State University’s engineering department, who runs a lab exploring challenges confronting AI, completed a study that measured ChatGPT’s performance on mathematical word problems.
Solving these word problems involves multiple steps, which require translating words into mathematical equations. But this sort of multi-step process also requires logical reasoning, which is something the algorithm is not engineered to do.
“Our initial tests on ChatGPT, done in early January, indicate that performance is significantly below the 60% accuracy for state-of-the-art algorithm for math word problem-solvers,” adds Shakarian.