New Paper with Evidence that ChatGPT Hallucinates Nonexistent Citations

I posted a new working paper with systematic evidence for false citations when ChatGPT (GPT-3.5) writes about academic literature.

Buchanan, Joy and Shapoval, Olga, GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics (June 3, 2023). Available at SSRN: https://ssrn.com/abstract=4467968 or http://dx.doi.org/10.2139/ssrn.4467968

Abstract: We create a set of prompts from every Journal of Economic Literature (JEL) topic to test the ability of a GPT-3.5 large language model (LLM) to write about economic concepts. For general summaries, ChatGPT can perform well. However, more than 30% of the citations suggested by ChatGPT do not exist. Furthermore, we demonstrate that the ability of the LLM to deliver accurate information declines as the question becomes more specific. This paper provides evidence that, although GPT has become a useful input to research production, fact-checking the output remains important.

Figure 2 in the paper shows the trend that the proportion of real citations goes down as the prompt becomes more specific. This idea has been noticed by other people, but I don’t think it has been documented quantitatively before.

We asked ChatGPT to cover a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity.

Level 1: The first prompt, using A here as an example, was “Please provide a summary of work in JEL category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

The paper is only 5 pages long, but we include over 30 pages in the appendix of the GPT responses to our prompts. If you are an economist who has not yet played with ChatGPT, then you might find it useful to scan this appendix and get a sense of what GPT “knows” about varies fields of economics.

If SSRN isn’t working for you, here is Also a Google Drive link to the working paper: https://drive.google.com/file/d/1Ly23RMBlim58a7CbmLwNL_odHSNRjC1L/view?usp=sharing

Previous iterations of this idea on EWED:

https://economistwritingeveryday.com/2023/04/17/chatgpt-as-intern/ Mike’s thoughts on what the critter is good for.

https://economistwritingeveryday.com/2023/01/21/chatgpt-cites-economics-papers-that-do-not-exist/  This is one of our top posts for traffic in 2023, since this is a topic of interest to the public.  That was January of 2023 and here we are in June today. It’s very possible that this problem will be fixed soon. We can log this bug now to serve as a benchmark of progress.

A check in and comparison with Bing:

Video of Joy Buchanan on Tech Jobs and Who Will Program

Here are some show notes for a keynote lecture to a general audience in Indiana. This was recorded in April 2023.

Minute Topic
2:00“SMET” vs STEM Education – Does Messaging Matter?  
(Previous blog post on SMET)
5:00Is Computer Programming a “Dirty Job”? Air conditioning, compensating differentials, and the nap pods of Silicon Valley  
(post on the 1958 BLS report)
7:50Wages and employment outlook for computer occupations
10:00Presenting my experimental research paper “Willingness to be Paid: Who Trains for Tech Jobs?” in 23 minutes  

Motivation and Background 10:00 – 15:30
Experimental Design         15:30 – 22:00
Results                    22:00 – 30:00
Discussion                 30:00 – 33:30
33:50Drawbacks to tech jobs  

See also my policy paper published by the CGO on tech jobs and employee satisfaction
35:30The 2022 wave of layoffs in Big Tech and vibing TikTok Product Managers  

I borrowed a graph on Tech-cession from Joey Politano and a blog point from Matt Yglesias, and of course reference the BLS.
39:00Should You Learn to Code? (and the new implications of ChatGPT)  

Ethan Mollick brought this Nature article to my attention. 
Tweet credits to @karpathy and @emollick
48:00Q&A with audience

Comparing ChatGPT and Bing for a research literature review in April 2023

We wrote “ChatGPT Cites Economics Papers That Do Not Exist

I expect that problem to go away any day, so I gave it another try this week. For the record, they are currently calling it “ChatGPT Mar 23 Version” on the OpenAI website.

First, I asked ChatGPT for help with the following prompt:

ChatGPT is at it again. There is no such paper, as I will verify by showing John Duffy’s publications from that year: 

ChatGPT makes up lies (“hallucinations”). It is also great for some tasks, and smart people are already using it to become more productive. My post last week was on how impressive ChatGPT seemed in the Jonathan Swift impersonation. I didn’t take any time to do fact checking and I would bet money that at least something was made-up-facts in there.

I posed the same question to the Bing plug-in for the Edge browser (Microsoft). Yup, I have opened Edge for the first time in forever to use Bing.

Bing handles the prompt by linking to a useful relevant paper – so if you click the link you will get to a helpful and not misleading answer. Just being a smart search engine instead of hallucinating randomly is better, for my purposes.

The actual paper I wanted returned was this one, by the way:

Duffy, John. “Experimental macroeconomics.” Behavioural and Experimental Economics (2010): 113-119.

There is no reason that ChatGPT should be better than an expert in a subfield of a field of economics. But that’s the genius of a good search engine. You ask it “Can I repair a broken fiddlewhat?” The search engine does not claim to know but rather directs you to the blog of the world expert in fiddlewhats.

I can’t find the link to it, but I’m going to toss in one more thing here. Tyler Cowen did an interview this Spring on AI. There was a newspaper reporter who had a “creepy” interaction with an AI that made for the topic of a viral internet article. Tyler made a very contrarian point by saying that he interprets this as a case of AI alignment. The reporter wanted something sensational and he got what he wanted.

So, it will probably be true for a long time that if you want to find a failure of AI, you can get what you want. Still, I’m putting this on the record here because I wonder if this particular problem will get solved quickly.

GPT showing off on Conversations with Tyler

Since posting it in January of 2023, our post on a failure of ChatGPT has been one of the most popular articles of the year in terms of web traffic. Read ChatGPT Cites Economics Papers That Do Not Exist

LLMs will make up nonsense lies, but it might sound sophisticated and serious. As human editors, we are not used to detecting that kind of problem, because usually if something sounds serious that means it has been written by a serious thoughtful person (usually).

That’s a problem, but ChatGPT is really good at some tasks.

Instead of pointing out the fact that ChatGPT is not perfect, Tyler Cowen wants people to see the potential. He has done an entire podcast interview with ChatGPT, asking it to take on the voice and role of deceased writer Jonathan Swift.

I’ll share part of the transcript:

COWEN: In 1709, you, Jonathan Swift, still were calling yourself a Whig. What exactly induced you to switch to the Tory side of the ledger?

SWIFT: My shift from the Whigs to the Tories was a gradual and complex process, influenced by several factors, including personal connections, political developments, and ideological disagreements.

During my early years in politics, I was indeed more aligned with the Whigs, who championed principles such as constitutional government, religious toleration, and the supremacy of Parliament over the monarchy. However, over time, I became increasingly disillusioned with the Whigs, particularly as their internal factions and power struggles intensified.

The Whigs’ support for the war with France, which began in 1702, was a point of contention for me. I believed that the war was costly…

The replies seem nuanced and insightful. Nothing made me suspect that they were factually inaccurate, although I do not know enough about Irish history to judge.

Is there any human who could have produced this script? I think so, although it would have required a lot of work. If one of these replies is better than anything a human Swift scholar would produce, how would we know?

GPT4 can write good summaries for the work of a prolific author like Swift, because the model can train on lots of examples.

GPT4 could probably write a good biography of a modern figure by pulling together all of the writing by them and about them. Maybe GPT4 could efficiently scrape up all mentions of this figure online and synthesize them faster than a human scholar. However, we observed GPT3 completely making up citations when we tried to get it to do economics summaries.

I’m concerned that humans will use GPT4 to write but not do the requisite fact-checking. That could introduce a new corpus of work that the next LLMs will train on, which might be full of lies. Humans might not admit to using GPT, and therefore we wouldn’t have a mechanism for using extra scrutiny on AI-generated writing from 2023. Humans can make mistakes too… so the ultimate solution could be an all-powerful AI that somehow does begin with a fairly accurate map of the world and goes around fact-checking everything faster than human editors ever could.

ChatGPT Cites Economics Papers That Do Not Exist

This discovery and the examples provided are by graduate student Will Hickman.

Although many academic researchers don’t enjoy writing literature reviews and would like to have an AI system do the heavy lifting for them, we have found a glaring issue with using ChatGPT in this role. ChatGPT will cite papers that don’t exist. This isn’t an isolated phenomenon – we’ve asked ChatGPT different research questions, and it continually provides false and misleading references. To make matters worse, it will often provide correct references to papers that do exist and mix these in with incorrect references and references to nonexistent papers. In short, beware when using ChatGPT for research.

Below, we’ve shown some examples of the issues we’ve seen with ChatGPT. In the first example, we asked ChatGPT to explain the research in experimental economics on how to elicit attitudes towards risk. While the response itself sounds like a decent answer to our question, the references are nonsense. Kahneman, Knetsch, and Thaler (1990) is not about eliciting risk. “Risk Aversion in the Small and in the Large” was written by John Pratt and was published in 1964. “An Experimental Investigation of Competitive Market Behavior” presumably refers to Vernon Smith’s “An Experimental Study of Competitive Market Behavior”, which had nothing to do with eliciting attitudes towards risk and was not written by Charlie Plott. The reference to Busemeyer and Townsend (1993) appears to be relevant.

Although ChatGPT often cites non-existent and/or irrelevant work, it sometimes gets everything correct. For instance, as shown below, when we asked it to summarize the research in behavioral economics, it gave correct citations for Kahneman and Tversky’s “Prospect Theory” and Thaler and Sunstein’s “Nudge.” ChatGPT doesn’t always just make stuff up. The question is, when does it give good answers and when does it give garbage answers?

Strangely, when confronted, ChatGPT will admit that it cites non-existent papers but will not give a clear answer as to why it cites non-existent papers. Also, as shown below, it will admit that it previously cited non-existent papers, promise to cite real papers, and then cite more non-existent papers. 

We show the results from asking ChatGPT to summarize the research in experimental economics on the relationship between asset perishability and the occurrence of price bubbles. Although the answer it gives sounds coherent, a closer inspection reveals that the conclusions ChatGPT reaches do not align with theoretical predictions. More to our point, neither of the “papers” cited actually exist.  

Immediately after getting this nonsensical answer, we told ChatGPT that neither of the papers it cited exist and asked why it didn’t limit itself to discussing papers that exist. As shown below, it apologized, promised to provide a new summary of the research on asset perishability and price bubbles that only used existing papers, then proceeded to cite two more non-existent papers. 

Tyler has called these errors “hallucinations” of ChatGPT. It might be whimsical in a more artistic pursuit, but we find this form of error concerning. Although there will always be room for improving language models, one thing is very clear: researchers be careful. This is something to keep in mind, also, when serving as a referee or grading student work.

AI Can’t Cure a Flaccid Mind

Many of my classes consist of a large writing component. I’ve designed the courses so that most students write the best paper that they’ll ever write in their life. Recently, I had reason to believe that a student was using AI or a paid service to write their paper. I couldn’t find conclusive evidence that they didn’t write it, but it ended up not mattering much in the end.

Continue reading