ChatGPT on Advent

I have a paper that emphasizes ChatGPT errors. It is important to recognize that LLMs can make mistakes. However, someone could look at our data and emphasize the opposite potential interpretation. On many points, and even when coming up with citations, the LLM generated correct sentences. More than half of the content was good.

You can read ChatGPT’s take on a wide variety of topics within economics, in the appendix of our paper. The journal hosts it at https://journals.sagepub.com/doi/suppl/10.1177/05694345231218454/suppl_file/sj-pdf-1-aex-10.1177_05694345231218454.pdf If that link does not work then the appendix has been up on SSRN since June in the form of the old version of the paper.

Apparently, LLMs just solved an unsolvable math problem. Is there anything they can’t do? Considering how much of human expression and culture revolves around religion, we can expect AI’s to get involved in that aspect of life.

Alex thinks it will be a short hop from Personal Jesus Chatbot to a whole new AI religion. We’ll see. People have had “LLMs” in the form of human pastors, shaman, or rabbis for a long time, and yet sticking to one sacred text for reference has been stable. I think people might feel the same way in the AI era – stick to the canon for a common point of reference. Text written before the AI era will be considered special for a long time, I predict. Even AI’s ought to be suspicious of AI-generated content, just in the way that humans are now (or are they?).

Many religious traditions have lots of training literature. (In our ChatGPT errors paper, we expect LLMs to produce reliable content on topics for which there is plentiful training literature.)

I gave ChatGPT this prompt:

Can you write a Bible study? I’d like this to be appropriate for the season of Advent, but I’d like most of the Bible readings to be from the book of Job. I’d like to consider what Job was going through, because he was trying to understand the human condition and our relationship to God before the idea of Jesus. Job had a conception of the goodness of God, but he didn’t have the hope of the Gospel. Can you work with that?

Continue reading

Video for new ChatGPT users

Have you not gotten around to trying ChatGPT for yourself yet?

Ethan and Lilach Mollick have released a series of YouTube videos that encapsulate some current insights, aimed at beginners, posted on Aug. 1, 2023. It covers ChatGPT, Bing, and Bard. Everyday free users are using these tools.

Practical AI for Instructors and Students Part 2: Large Language Models (LLMs)

If you are already using ChatGPT, then this video will probably feel too slow. However, they do have some tips that amateurs could learn from even if they have already experimented. E. Mollick says of LLMs “they are not sentient,” but it might be helpful to treat them as if they are. He also recommends thinking of ChatGPT like an “intern” which is also how Mike formulated his suggestion back in April.

  • I used GPT-3.5 a few times this week for routine work tasks. I am not a heavy user, but if any of our readers are still on the fence, I’d encourage you to watch this video and give it a try. Be a “complement” to ChatGPT.
  • I’ll be posting new updates about my own ChatGPT research soon – the errors paper and also a new survey on trust in AI.
  • I hear regular complaints from my colleagues all over the country about poor attempts by college students to get GPT to do their course work. The experiment is being run.
  • Ethan Mollick has been a good Twitter(X) follow for the past year, if you want to keep up with the evolution and study of Large Language Models. https://twitter.com/emollick/status/1709379365883019525
  • Scott wrote this great recent tutorial on the theory behind the tools: Generative AI Nano-Tutorial
  • It was only back in December 2023 that I did a live ChatGPT demonstration in class, and figured that I was giving my students there first ever look at LLMs. Today, I’d assume that all my students have tried it for themselves.
  • In my paper on who will train for tech jobs, I conclude that the labor supply of programmers would increase if more people enjoyed the work. LLMs might make tech jobs less tedious and therefore more fun. If labor supply shifts out, then quantity should increase and wages should fall – good news for innovative businesses.

OpenAI wants you to fool their AI

OpenAI created the popular Dall-E and ChatGPT AI models. They try to make their models “safe”, but many people make a hobby of breaking through any restrictions and getting ChatGPT to say things its not supposed to:

Source: Zack Witten

Now trying to fool OpenAI models can be more than a hobby. OpenAI just announced a call for experts to “Red Team” their models. They have already been doing all sorts of interesting adversarial tests internally:

Now they want all sorts of external experts to give it a try, including economists:

This seems like a good opportunity to me, both to work on important cutting-edge technology, and to at least arguably make AI safer for humanity. For a long time it seemed like you had to be a top-tier mathematician or machine learning programmer to have any chance of contributing to AI safety, but the field is now broadening dramatically as capable models start to be deployed widely. I plan to apply if I find any time to spare, perhaps some of you will too.

The models definitely still need work- this is what I got after prompting Dall-E 2 for “A poster saying “OpenAI wants you…. to fool their models” in the style of “Uncle Sam Wants You””

Generative AI Nano-Tutorial

Everyone who has not been living under a rock this year has heard the buzz around ChatGPT and generative AI. However, not everyone may have clear definitions in mind, or understanding of how this stuff works.

Artificial intelligence (AI) has been around in one form or another for decades. Computers have long been used to analyze information and come up with actionable answers. Classically, computer output has been in the form of numbers or graphical representation of numbers. Or perhaps in the form of chess moves, beating all human opponents since about 2000.

Generative AI is able to “generate” a variety of novel content, such as images, video, music, speech, text, software code and product designs, with quality which is difficult to distinguish from human-produced content. This mimicry of human content creation is enabled by having the AI programs analyze reams and reams of existing content (“training data”), using enormous computing power.

I wanted to excerpt here a fine article I just saw which is informative on this subject. Among other things, it lists some examples of gen-AI products, and describes the “transformer” model that underpins many of these products. I skipped the section of the article that discusses the potential dangers of gen-AI (e.g., problems with false “hallucinations”), since that topic has been treated already in this blog.

Between this article and the Wikipedia article on Generative artificial intelligence , you should be able to hold your own, or at least ask intelligent questions, when the subject next comes up in your professional life (which it likely will, sooner or later).

One technical point for data nerds is the distinction between “generative” and “discriminative” approaches in modeling. This is not treated in the article below, but see here.

All text below the line of asterisks is from Generative AI Defined: How it Works, Benefits and Dangers, by Owen Hughes, Aug 7, 2023.

*******************************************************

What is generative AI in simple terms?

Generative AI is a type of artificial intelligence technology that broadly describes machine learning systems capable of generating text, images, code or other types of content, often in response to a prompt entered by a user.

Generative AI models are increasingly being incorporated into online tools and chatbots that allow users to type questions or instructions into an input field, upon which the AI model will generate a human-like response.

How does generative AI work?

Generative AI models use a complex computing process known as deep learning to analyze common patterns and arrangements in large sets of data and then use this information to create new, convincing outputs. The models do this by incorporating machine learning techniques known as neural networks, which are loosely inspired by the way the human brain processes and interprets information and then learns from it over time.

To give an example, by feeding a generative AI model vast amounts of fiction writing, over time the model would be capable of identifying and reproducing the elements of a story, such as plot structure, characters, themes, narrative devices and so on.

……

Examples of generative AI

…There are a variety of generative AI tools out there, though text and image generation models are arguably the most well-known. Generative AI models typically rely on a user feeding it a prompt that guides it towards producing a desired output, be it text, an image, a video or a piece of music, though this isn’t always the case.

Examples of generative AI models include:

  • ChatGPT: An AI language model developed by OpenAI that can answer questions and generate human-like responses from text prompts.
  • DALL-E 2: Another AI model by OpenAI that can create images and artwork from text prompts.
  • Google Bard: Google’s generative AI chatbot and rival to ChatGPT. It’s trained on the PaLM large language model and can answer questions and generate text from prompts.
  • Midjourney: Developed by San Francisco-based research lab Midjourney Inc., this gen AI model interprets text prompts to produce images and artwork, similar to DALL-E 2.
  • GitHub Copilot: An AI-powered coding tool that suggests code completions within the Visual Studio, Neovim and JetBrains development environments.
  • Llama 2: Meta’s open-source large language model can be used to create conversational AI models for chatbots and virtual assistants, similar to GPT-4.
  • xAI: After funding OpenAI, Elon Musk left the project in July 2023 and announced this new generative AI venture. Little is currently known about it.

Types of generative AI models

There are various types of generative AI models, each designed for specific challenges and tasks. These can broadly be categorized into the following types.

Transformer-based models

Transformer-based models are trained on large sets of data to understand the relationships between sequential information, such as words and sentences. Underpinned by deep learning, these AI models tend to be adept at NLP [natural language processing] and understanding the structure and context of language, making them well suited for text-generation tasks. ChatGPT-3 and Google Bard are examples of transformer-based generative AI models.

Generative adversarial networks

GANs are made up of two neural networks known as a generator and a discriminator, which essentially work against each other to create authentic-looking data. As the name implies, the generator’s role is to generate convincing output such as an image based on a prompt, while the discriminator works to evaluate the authenticity of said image. Over time, each component gets better at their respective roles, resulting in more convincing outputs. Both DALL-E and Midjourney are examples of GAN-based generative AI models…

Multimodal models

Multimodal models can understand and process multiple types of data simultaneously, such as text, images and audio, allowing them to create more sophisticated outputs. An example might be an AI model capable of generating an image based on a text prompt, as well as a text description of an image prompt. DALL-E 2 and OpenAI’s GPT-4 are examples of multimodal models.

What is ChatGPT?

ChatGPT is an AI chatbot developed by OpenAI. It’s a large language model that uses transformer architecture — specifically, the “generative pretrained transformer”, hence GPT — to understand and generate human-like text.

What is Google Bard?

Google Bard is another example of an LLM based on transformer architecture. Similar to ChatGPT, Bard is a generative AI chatbot that generates responses to user prompts.

Google launched Bard in the U.S. in March 2023 in response to OpenAI’s ChatGPT and Microsoft’s Copilot AI tool. In July 2023, Google Bard was launched in Europe and Brazil.

…….

Benefits of generative AI

For businesses, efficiency is arguably the most compelling benefit of generative AI because it can enable enterprises to automate specific tasks and focus their time, energy and resources on more important strategic objectives. This can result in lower labor costs, greater operational efficiency and new insights into how well certain business processes are — or are not — performing.

For professionals and content creators, generative AI tools can help with idea creation, content planning and scheduling, search engine optimization, marketing, audience engagement, research and editing and potentially more. Again, the key proposed advantage is efficiency because generative AI tools can help users reduce the time they spend on certain tasks so they can invest their energy elsewhere. That said, manual oversight and scrutiny of generative AI models remains highly important.

Use cases of generative AI

Generative AI has found a foothold in a number of industry sectors and is rapidly expanding throughout commercial and consumer markets. McKinsey estimates that, by 2030, activities that currently account for around 30% of U.S. work hours could be automated, prompted by the acceleration of generative AI.

In customer support, AI-driven chatbots and virtual assistants help businesses reduce response times and quickly deal with common customer queries, reducing the burden on staff. In software development, generative AI tools help developers code more cleanly and efficiently by reviewing code, highlighting bugs and suggesting potential fixes before they become bigger issues. Meanwhile, writers can use generative AI tools to plan, draft and review essays, articles and other written work — though often with mixed results.

The use of generative AI varies from industry to industry and is more established in some than in others. Current and proposed use cases include the following:

  • Healthcare: Generative AI is being explored as a tool for accelerating drug discovery, while tools such as AWS HealthScribe allow clinicians to transcribe patient consultations and upload important information into their electronic health record.
  • Digital marketing: Advertisers, salespeople and commerce teams can use generative AI to craft personalized campaigns and adapt content to consumers’ preferences, especially when combined with customer relationship management data.
  • Education: Some educational tools are beginning to incorporate generative AI to develop customized learning materials that cater to students’ individual learning styles.
  • Finance: Generative AI is one of the many tools within complex financial systems to analyze market patterns and anticipate stock market trends, and it’s used alongside other forecasting methods to assist financial analysts.
  • Environment: In environmental science, researchers use generative AI models to predict weather patterns and simulate the effects of climate change

….

Generative AI vs. machine learning

As described earlier, generative AI is a subfield of artificial intelligence. Generative AI models use machine learning techniques to process and generate data. Broadly, AI refers to the concept of computers capable of performing tasks that would otherwise require human intelligence, such as decision making and NLP.

Machine learning is the foundational component of AI and refers to the application of computer algorithms to data for the purposes of teaching a computer to perform a specific task. Machine learning is the process that enables AI systems to make informed decisions or predictions based on the patterns they have learned.

( Again, to make sure credit goes where it is due, the text below the line of asterisks above was excerpted from Generative AI Defined: How it Works, Benefits and Dangers, by Owen Hughes).

Malwarebytes Poll: Public Fascination with ChapGPT Has Turned to Suspicion

ChatGPT and related AI have been all the rage these past few months. Among other things, “AI” became the shiny object that companies have dangled before investors, rocketing upward the shares of the “Magnificent Seven” large tech stocks.

However, a recent poll by computer security firm Malwarebytes notes a marked turn in the public’s attitude towards these products:

It seems the lustre of the chatbot-that’s-going-to-change-everything is starting to fade….

When people explored its capabilities in the days and weeks after its launch, it seemed almost miraculous—a wonder tool that could do everything from creating computer programs and replacing search engines, to writing students’ essays and penning punk rock songs. Its release kick-started a race to disrupt everything with AI, and integrate ChatGPT-like interfaces into every conceivable tech product.

But those that know the hype cycle know that the Peak of Inflated Expectations is quickly followed by the Trough of Disillusionment. Predictably, ChatGPT’s rapid ascent was met by an equally rapid backlash as its shortcomings became apparent….

A new survey by Malwarebytes exposes deep reservations about ChatGPT, with optimism in startlingly short supply. Of the respondents familiar with ChatGPT:

  • 81% were concerned about possible security and safety risks.
  • 63% don’t trust the information it produces.
  • 51% would like to see work on it paused so regulations can catch up.

The concerns expressed in the survey mirror the trajectory of the news about ChatGPT since its introduction in November 2022.

As EWED’s own Joy Buchanan has been pointing out specifically with regard to citations for research papers (here, here, and in the Wall Street Journal), ChatGPT tends to “hallucinate”, i.e., to report things that are not simply true. In the recent working paper “GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics” , she warns of the possibility of a vicious spiral of burgeoning falsehoods, where AI-generated errors which are introduced into internet content such as research papers are then picked up as “learning” input into the next generation of AI training.

Real-world consequences of ChatGPT’s hallucinations are starting to crop up. A lawyer has found himself in deep trouble after filing an error-ridden submission in an active court case. Evidently his assistant, also an attorney, relied on ChatGPT which came up with a raft of “citations to non-existent cases.” Oops.

And now we have what is believed to be “the first defamation lawsuit against artificial intelligence.” Talk show host Mark Walters filed a complaint in Georgia which is:

…asking for a jury trial to assess damages after ChatGPT produced a false complaint to a journalist about the radio host.  The faux lawsuit claimed that Mr. Walters, the CEO of CCW Broadcast Media, worked for a gun rights group as treasurer and embezzled funds.

…Legal scholars have split on whether the bots should be sued for defamation or under product liability, given it’s a machine — not a person — spreading the false, hurtful information about people.

The issue arose when an Australian mayor threatened to sue the AI company this year over providing false news reports that he was guilty of a foreign bribery scandal.

Wow.

Thousands of AI experts and others have signed an open letter asking: “Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization?”.    The letter states that “Such decisions must not be delegated to unelected tech leaders. Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable.”    It therefore urges  “all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4…If such a pause cannot be enacted quickly, governments should step in and institute a moratorium…AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts.”

Oh, the irony. There is reason to believe that our Stone Age ancestors were creating unreal images like this:

whilst stoned on peyote, shrooms, or oxygen deprivation. And here we are in 2023, with the cutting edge in information technology, running on the fastest specially-fabricated computing devices, and we get…hallucinations.

New Paper with Evidence that ChatGPT Hallucinates Nonexistent Citations

I posted a new working paper with systematic evidence for false citations when ChatGPT (GPT-3.5) writes about academic literature.

Buchanan, Joy and Shapoval, Olga, GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics (June 3, 2023). Available at SSRN: https://ssrn.com/abstract=4467968 or http://dx.doi.org/10.2139/ssrn.4467968

Abstract: We create a set of prompts from every Journal of Economic Literature (JEL) topic to test the ability of a GPT-3.5 large language model (LLM) to write about economic concepts. For general summaries, ChatGPT can perform well. However, more than 30% of the citations suggested by ChatGPT do not exist. Furthermore, we demonstrate that the ability of the LLM to deliver accurate information declines as the question becomes more specific. This paper provides evidence that, although GPT has become a useful input to research production, fact-checking the output remains important.

Figure 2 in the paper shows the trend that the proportion of real citations goes down as the prompt becomes more specific. This idea has been noticed by other people, but I don’t think it has been documented quantitatively before.

We asked ChatGPT to cover a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity.

Level 1: The first prompt, using A here as an example, was “Please provide a summary of work in JEL category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

The paper is only 5 pages long, but we include over 30 pages in the appendix of the GPT responses to our prompts. If you are an economist who has not yet played with ChatGPT, then you might find it useful to scan this appendix and get a sense of what GPT “knows” about varies fields of economics.

If SSRN isn’t working for you, here is Also a Google Drive link to the working paper: https://drive.google.com/file/d/1Ly23RMBlim58a7CbmLwNL_odHSNRjC1L/view?usp=sharing

Previous iterations of this idea on EWED:

https://economistwritingeveryday.com/2023/04/17/chatgpt-as-intern/ Mike’s thoughts on what the critter is good for.

https://economistwritingeveryday.com/2023/01/21/chatgpt-cites-economics-papers-that-do-not-exist/  This is one of our top posts for traffic in 2023, since this is a topic of interest to the public.  That was January of 2023 and here we are in June today. It’s very possible that this problem will be fixed soon. We can log this bug now to serve as a benchmark of progress.

A check in and comparison with Bing:

Video of Joy Buchanan on Tech Jobs and Who Will Program

Here are some show notes for a keynote lecture to a general audience in Indiana. This was recorded in April 2023.

Minute Topic
2:00“SMET” vs STEM Education – Does Messaging Matter?  
(Previous blog post on SMET)
5:00Is Computer Programming a “Dirty Job”? Air conditioning, compensating differentials, and the nap pods of Silicon Valley  
(post on the 1958 BLS report)
7:50Wages and employment outlook for computer occupations
10:00Presenting my experimental research paper “Willingness to be Paid: Who Trains for Tech Jobs?” in 23 minutes  

Motivation and Background 10:00 – 15:30
Experimental Design         15:30 – 22:00
Results                    22:00 – 30:00
Discussion                 30:00 – 33:30
33:50Drawbacks to tech jobs  

See also my policy paper published by the CGO on tech jobs and employee satisfaction
35:30The 2022 wave of layoffs in Big Tech and vibing TikTok Product Managers  

I borrowed a graph on Tech-cession from Joey Politano and a blog point from Matt Yglesias, and of course reference the BLS.
39:00Should You Learn to Code? (and the new implications of ChatGPT)  

Ethan Mollick brought this Nature article to my attention. 
Tweet credits to @karpathy and @emollick
48:00Q&A with audience

Comparing ChatGPT and Bing for a research literature review in April 2023

We wrote “ChatGPT Cites Economics Papers That Do Not Exist

I expect that problem to go away any day, so I gave it another try this week. For the record, they are currently calling it “ChatGPT Mar 23 Version” on the OpenAI website.

First, I asked ChatGPT for help with the following prompt:

ChatGPT is at it again. There is no such paper, as I will verify by showing John Duffy’s publications from that year: 

ChatGPT makes up lies (“hallucinations”). It is also great for some tasks, and smart people are already using it to become more productive. My post last week was on how impressive ChatGPT seemed in the Jonathan Swift impersonation. I didn’t take any time to do fact checking and I would bet money that at least something was made-up-facts in there.

I posed the same question to the Bing plug-in for the Edge browser (Microsoft). Yup, I have opened Edge for the first time in forever to use Bing.

Bing handles the prompt by linking to a useful relevant paper – so if you click the link you will get to a helpful and not misleading answer. Just being a smart search engine instead of hallucinating randomly is better, for my purposes.

The actual paper I wanted returned was this one, by the way:

Duffy, John. “Experimental macroeconomics.” Behavioural and Experimental Economics (2010): 113-119.

There is no reason that ChatGPT should be better than an expert in a subfield of a field of economics. But that’s the genius of a good search engine. You ask it “Can I repair a broken fiddlewhat?” The search engine does not claim to know but rather directs you to the blog of the world expert in fiddlewhats.

I can’t find the link to it, but I’m going to toss in one more thing here. Tyler Cowen did an interview this Spring on AI. There was a newspaper reporter who had a “creepy” interaction with an AI that made for the topic of a viral internet article. Tyler made a very contrarian point by saying that he interprets this as a case of AI alignment. The reporter wanted something sensational and he got what he wanted.

So, it will probably be true for a long time that if you want to find a failure of AI, you can get what you want. Still, I’m putting this on the record here because I wonder if this particular problem will get solved quickly.

GPT showing off on Conversations with Tyler

Since posting it in January of 2023, our post on a failure of ChatGPT has been one of the most popular articles of the year in terms of web traffic. Read ChatGPT Cites Economics Papers That Do Not Exist

LLMs will make up nonsense lies, but it might sound sophisticated and serious. As human editors, we are not used to detecting that kind of problem, because usually if something sounds serious that means it has been written by a serious thoughtful person (usually).

That’s a problem, but ChatGPT is really good at some tasks.

Instead of pointing out the fact that ChatGPT is not perfect, Tyler Cowen wants people to see the potential. He has done an entire podcast interview with ChatGPT, asking it to take on the voice and role of deceased writer Jonathan Swift.

I’ll share part of the transcript:

COWEN: In 1709, you, Jonathan Swift, still were calling yourself a Whig. What exactly induced you to switch to the Tory side of the ledger?

SWIFT: My shift from the Whigs to the Tories was a gradual and complex process, influenced by several factors, including personal connections, political developments, and ideological disagreements.

During my early years in politics, I was indeed more aligned with the Whigs, who championed principles such as constitutional government, religious toleration, and the supremacy of Parliament over the monarchy. However, over time, I became increasingly disillusioned with the Whigs, particularly as their internal factions and power struggles intensified.

The Whigs’ support for the war with France, which began in 1702, was a point of contention for me. I believed that the war was costly…

The replies seem nuanced and insightful. Nothing made me suspect that they were factually inaccurate, although I do not know enough about Irish history to judge.

Is there any human who could have produced this script? I think so, although it would have required a lot of work. If one of these replies is better than anything a human Swift scholar would produce, how would we know?

GPT4 can write good summaries for the work of a prolific author like Swift, because the model can train on lots of examples.

GPT4 could probably write a good biography of a modern figure by pulling together all of the writing by them and about them. Maybe GPT4 could efficiently scrape up all mentions of this figure online and synthesize them faster than a human scholar. However, we observed GPT3 completely making up citations when we tried to get it to do economics summaries.

I’m concerned that humans will use GPT4 to write but not do the requisite fact-checking. That could introduce a new corpus of work that the next LLMs will train on, which might be full of lies. Humans might not admit to using GPT, and therefore we wouldn’t have a mechanism for using extra scrutiny on AI-generated writing from 2023. Humans can make mistakes too… so the ultimate solution could be an all-powerful AI that somehow does begin with a fairly accurate map of the world and goes around fact-checking everything faster than human editors ever could.

ChatGPT Cites Economics Papers That Do Not Exist

EDIT: See my new published paper on this topic “ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

This blog post is co-authored with graduate student Will Hickman.

EDIT: Will and I now have a paper on trusting ChatGPT “Do People Trust Humans More Than ChatGPT?

Although many academic researchers don’t enjoy writing literature reviews and would like to have an AI system do the heavy lifting for them, we have found a glaring issue with using ChatGPT in this role. ChatGPT will cite papers that don’t exist. This isn’t an isolated phenomenon – we’ve asked ChatGPT different research questions, and it continually provides false and misleading references. To make matters worse, it will often provide correct references to papers that do exist and mix these in with incorrect references and references to nonexistent papers. In short, beware when using ChatGPT for research.

Below, we’ve shown some examples of the issues we’ve seen with ChatGPT. In the first example, we asked ChatGPT to explain the research in experimental economics on how to elicit attitudes towards risk. While the response itself sounds like a decent answer to our question, the references are nonsense. Kahneman, Knetsch, and Thaler (1990) is not about eliciting risk. “Risk Aversion in the Small and in the Large” was written by John Pratt and was published in 1964. “An Experimental Investigation of Competitive Market Behavior” presumably refers to Vernon Smith’s “An Experimental Study of Competitive Market Behavior”, which had nothing to do with eliciting attitudes towards risk and was not written by Charlie Plott. The reference to Busemeyer and Townsend (1993) appears to be relevant.

Although ChatGPT often cites non-existent and/or irrelevant work, it sometimes gets everything correct. For instance, as shown below, when we asked it to summarize the research in behavioral economics, it gave correct citations for Kahneman and Tversky’s “Prospect Theory” and Thaler and Sunstein’s “Nudge.” ChatGPT doesn’t always just make stuff up. The question is, when does it give good answers and when does it give garbage answers?

Strangely, when confronted, ChatGPT will admit that it cites non-existent papers but will not give a clear answer as to why it cites non-existent papers. Also, as shown below, it will admit that it previously cited non-existent papers, promise to cite real papers, and then cite more non-existent papers. 

We show the results from asking ChatGPT to summarize the research in experimental economics on the relationship between asset perishability and the occurrence of price bubbles. Although the answer it gives sounds coherent, a closer inspection reveals that the conclusions ChatGPT reaches do not align with theoretical predictions. More to our point, neither of the “papers” cited actually exist.  

Immediately after getting this nonsensical answer, we told ChatGPT that neither of the papers it cited exist and asked why it didn’t limit itself to discussing papers that exist. As shown below, it apologized, promised to provide a new summary of the research on asset perishability and price bubbles that only used existing papers, then proceeded to cite two more non-existent papers. 

Tyler has called these errors “hallucinations” of ChatGPT. It might be whimsical in a more artistic pursuit, but we find this form of error concerning. Although there will always be room for improving language models, one thing is very clear: researchers be careful. This is something to keep in mind, also, when serving as a referee or grading student work.