Generative artificial intelligence (AI) is notoriously prone to factual errors. So, what do you do when you’ve asked ChatGPT to generate 150 presumed facts and you don’t want to spend an entire weekend confirming each by hand?
Well, in my case, I turned to other AIs. In this article, I’ll explain the project, consider how each AI performed in a fact-checking showdown, and provide some final thoughts and cautions if you also want to venture down this maze of twisty, little passages that are all alike.
So here’s the thing. If GPT-4, the OpenAI large language model (LLM) used by ChatGPT Plus, generated the fact statements, I wasn’t entirely convinced it should be checking them. That’s like asking high school students to write a history paper without using any references, and then self-correct their work. They’re already starting with suspect information — and then you’re letting them correct themselves? No, that doesn’t sound right to me.
But what if we fed those facts to other LLMs inside of other AIs? Both Google’s Bard and Anthropic’s Claude have their LLMs. Bing uses GPT-4, but I figured I’d test its responses just to be completionist. As you’ll see, I got the best feedback from Bard, so I fed its responses back into ChatGPT in a round-robin perversion of the natural order of the universe.
Anthropic Claude
Claude employs the Claude 2 LLM, also utilized within Notion’s AI implementation. I provided it with a PDF containing the full set of facts (without pictures). Overall, Claude found the fact list to be mostly accurate, but it had some clarifications for three items. Due to the limit on the length of ChatGPT facts, nuance in the fact descriptions was inhibited, and Claude took issue with some of that lack of nuance. In general, it was an encouraging response.
Copilot… or nopilot?
Moving on to Microsoft’s Copilot, the renamed Bing Chat AI. Copilot doesn’t allow PDFs to be uploaded, so I attempted to paste in the text from all 50 state facts. This approach failed immediately because Copilot only accepts prompts of up to 2,000 characters.
I asked Copilot the following: “The following text contains state names followed by three facts for each state. Please examine the facts and identify any that are in error for that state.” Here’s what I got back:
It essentially repeated the fact data I asked it to check. Attempts to guide it with a more forceful prompt resulted in it providing the same data I asked it to verify. This output seemed odd, considering Copilot uses the same LLM as ChatGPT. Clearly, Microsoft has tuned it differently than ChatGPT. I gave up and moved onto Bard.
Bard
Google recently announced its new Gemini LLM. As I don’t have access to Gemini, I ran these tests on Google’s PaLM 2 model. Compared to Claude and Copilot, Bard excelled, or, in a more Shakespearian manner, it “doth bestride the narrow world like a Colossus.” Check out the results below:
However, there were discrepancies. I fed this list back to ChatGPT, and it found two discrepancies in the Alaska and Ohio answers. Bard’s fact-checking appears impressive, but it often misses the point and gets things just as wrong as any other AI.
Let’s consider Nevada and Area 51, as an example. ChatGPT mentioned, “Top-secret military base, rumoured UFO sightings.” Bard attempted to clarify, asserting, “Area 51 isn’t merely rumoured to have UFO sightings. It’s a genuine top-secret military facility, and its purpose is unknown.” Essentially, they convey similar information. The nuance lost in Bard’s response stems from the constraints of a concise word limit.
Another instance where Bard criticized ChatGPT without grasping the context was concerning Minnesota. While Wisconsin boasts numerous lakes, Bard didn’t assert that Minnesota has the most lakes. It simply labelled Minnesota as the “Land of 10,000 lakes,” a common slogan for the state.
Kansas became another point of contention for Bard. ChatGPT stated, “Home to the geographic centre of the contiguous US.” Bard argued it was South Dakota, valid when considering Alaska and Hawaii. However, ChatGPT specified “contiguous,” and in that context, the honour goes to a location near Lebanon, Kansas.
ChatGPT
Right away, I could tell Bard got one of its facts wrong – Alaska is far bigger than Texas. So, I wanted to see if ChatGPT could fact-check Bard’s claim. It’s commonly accepted that Wilbur and Orville Wright flew the first aircraft, although they built their Wright Flyer in Dayton, Ohio. However, ChatGPT took issue with Bard’s erroneous claim that Texas is the biggest state. It also had a bit of a tizzy over Ohio vs. Kansas as the birth of aviation, which is more controversial than most schools teach.
As you can see, ChatGPT took issue with Bard’s claim that Texas is the biggest state. It also disagreed over Ohio vs. Kansas as the birthplace of aviation, a topic more controversial than taught in most schools.
Conclusions and cautions
Let’s address something upfront: if you’re submitting a paper or a document where factual accuracy is crucial, it’s imperative to conduct your own fact-checking. Otherwise, your aspirations, akin to Texas, might find themselves overshadowed by an Alaska-sized problem.
As evidenced in our tests, the outcomes, much like Bard’s, might appear impressive but can be entirely or partially inaccurate. On the whole, delving into the realm of having various AIs crosscheck each other was intriguing, and it’s a process I intend to delve into further. However, the results only firmly established how inconclusive they were.
Copilot threw in the towel entirely, expressing a desire to return to its nap. Claude raised concerns about the nuance in a few responses. Bard came down hard on a multitude of answers, proving that, apparently, to err is not confined to human nature but extends to AI as well.
In conclusion, let me borrow the words of the venerable Bard himself and proclaim, “Confusion now hath made his masterpiece!”