New Paper with Evidence that ChatGPT Hallucinates Nonexistent Citations

I posted a new working paper with systematic evidence for false citations when ChatGPT (GPT-3.5) writes about academic literature.

Buchanan, Joy and Shapoval, Olga, GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics (June 3, 2023). Available at SSRN: https://ssrn.com/abstract=4467968 or http://dx.doi.org/10.2139/ssrn.4467968

Abstract: We create a set of prompts from every Journal of Economic Literature (JEL) topic to test the ability of a GPT-3.5 large language model (LLM) to write about economic concepts. For general summaries, ChatGPT can perform well. However, more than 30% of the citations suggested by ChatGPT do not exist. Furthermore, we demonstrate that the ability of the LLM to deliver accurate information declines as the question becomes more specific. This paper provides evidence that, although GPT has become a useful input to research production, fact-checking the output remains important.

Figure 2 in the paper shows the trend that the proportion of real citations goes down as the prompt becomes more specific. This idea has been noticed by other people, but I don’t think it has been documented quantitatively before.

We asked ChatGPT to cover a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity.

Level 1: The first prompt, using A here as an example, was “Please provide a summary of work in JEL category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

The paper is only 5 pages long, but we include over 30 pages in the appendix of the GPT responses to our prompts. If you are an economist who has not yet played with ChatGPT, then you might find it useful to scan this appendix and get a sense of what GPT “knows” about varies fields of economics.

If SSRN isn’t working for you, here is Also a Google Drive link to the working paper: https://drive.google.com/file/d/1Ly23RMBlim58a7CbmLwNL_odHSNRjC1L/view?usp=sharing

Previous iterations of this idea on EWED:

https://economistwritingeveryday.com/2023/04/17/chatgpt-as-intern/ Mike’s thoughts on what the critter is good for.

https://economistwritingeveryday.com/2023/01/21/chatgpt-cites-economics-papers-that-do-not-exist/  This is one of our top posts for traffic in 2023, since this is a topic of interest to the public.  That was January of 2023 and here we are in June today. It’s very possible that this problem will be fixed soon. We can log this bug now to serve as a benchmark of progress.

A check in and comparison with Bing:

Self-Replicating Machines: A Practical Human Response

Currently, we have software that can write software. What about physical machines that can produce physical machines? Indeed, what about machines that can produce other machines without human direction?

First of all, machines-building machines (MBM) still require resources: energy, transportation, time, and other inputs. A well-programmed machine that self-replicates quickly can grow in number exponentially. But where would the machines get the resources that enable self-replication? They’d have to purchase them (or conquer the world sci-fi style). Where would a machine get the resources to make purchases of necessary inputs? The same place that everyone else gets them.

Continue reading

AI Can’t Cure a Flaccid Mind

Many of my classes consist of a large writing component. I’ve designed the courses so that most students write the best paper that they’ll ever write in their life. Recently, I had reason to believe that a student was using AI or a paid service to write their paper. I couldn’t find conclusive evidence that they didn’t write it, but it ended up not mattering much in the end.

Continue reading

Message To My Students: Don’t Use AI to Cheat (at least not yet)

If you have spent any time on social media in the past week, you’ve probably noticed a lot of people using the new AI program called ChatGPT. Joy blogged about it recently too. It’s a fun thing to play with and often gives you very good (or at least interesting) responses to questions you ask. And it’s blown up on social media, probably because it’s free, responds instantly, and is easy to screenshot.

But as with all things AI, there are numerous concerns that come up, both theoretical and immediately real. One immediately real concern among academics is the possibility of cheating by students on homework, short writing assignments, or take-home exams. I don’t want to diminish these concerns, but I think for now they are overblown. Let me demonstrate by example.

This semester I am teaching an undergraduate course in Economic History. Two of the big topics we cover are the Industrial Revolution and the Great Depression. Specifically, we spend a lot of time discussing the various theories of the causes of these two events. On the exams, students are asked to, more or less, summarize these potential causes and discuss them.

How does ChatGPT do?

On the Industrial Revolution:

And on the Great Depression:

Now, it’s not that these answers are flat out wrong. The answers certainly list theories that have been discussed by at various times, including in the academic literature. But these answers just wouldn’t be very good for my class, primarily because they miss almost all of the theories that we have discussed in class as being likely causes. Moreover, the answers also list theories that we have discussed in class as probably not being correct.

These kinds of errors are especially true of the answer about the Great Depression, which reads like it was taken straight from a high school history textbook, ignoring almost everything economists have said about the topic. The answer for the Industrial Revolution doesn’t make this mistake as much as it misses most of the theories discussed by Koyama and Rubin, which was the main book we used to work through the literature. If a student gave an answer like the AI, it suggests to me that they didn’t even look at the chapter titles in K&R, which provide a roadmap of the main theories.

So, my message to students: don’t try to use this to answer questions in class, at least not right now. The program will certainly improve in the future, and perhaps it will eventually get very good at answering these kinds of academic questions.

But I also have a message to fellow academics: make sure that you are writing questions that aren’t easily answered by an AI. This can be hard to do, especially if you haven’t thought about it deeply, but ultimately thinking in this way should help you to write better exam and homework questions. This approach seems far superior to the one that the AI suggests.

Introducing Students to Text Mining II

In the Fall of 2020, I blogged about how I introduce students to text mining, as part of a data analytics class.

Could Turing ever have imagined that a human seeking customer service from a bank could chat with a bot? Maybe text mining is a big advance over chess, but it only took about one decade longer for a computer (developed by IBM) to beat a human in Jeopardy. Winning Jeopardy requires the computer to get meaning from a sentence of words. Computers have already moved way beyond playing a game show to natural language processing.

https://economistwritingeveryday.com/2020/11/07/introducing-students-to-text-mining/

I told the students that “chat bots” are getting better and NLP is advancing. By July 2020, OpenAI had released a beta API playground to external developers to play with GPT-3, but I did not sign up to use it myself.

In April of 2022, I added some slides inspired by Alex’s post about the Turing Test that included output from Google’s Pathway Languages Model. According to Alex, “It seems obvious that the computer is reasoning.”

This week in class, I did something that few people could have imagined 5 years ago. I signed into the free new GPTChat function in class and typed in questions from my students.

We started with questions that we assumed would be easy to answer:

Then we were surprised that it answered a question we had thought would be difficult:

And then we asked two questions that prompted the program to hedge, although for different reasons.

It seems like the model is smarter than it lets on. For now, the creators are trying hard not to offend anyone or get in the way of Google’s advertising business. Overall, the quality of the answers are high.

Because of when I was born, I believe that something I have published will make it into the training data for these models. Will that turn out to be more significant than any human readers we can attract?

Of course, GPT can still make mistakes. I’m horrified by this mischaracterization of my tweets:

Notes on Austin and Health Economics

I was in Austin Texas for the first time this week for the first in-person meeting of the American Society of Health Economists since 2019. Some quick impressions on Austin:

  • Austin reminds me of many Southern cities, but Nashville most of all. Both historic state capitals that are booming, lots of people moving in and new infrastructure actually being built, forests of cranes putting up new glass towers. Both filled with bars, restaurants, and especially live music. But even with so much happening and so much being built, they don’t *feel* dense, you can always see lots of sky even downtown.
  • Austin seems to be a bizarre “pharmacy desert”, I think I walked 14 miles all through town before I saw one. Contrast to NYC with a Duane Reade on every block. In fact downtown seemed to have almost no chains of any kind, restaurants included; I wonder if this is just about consumer preferences or there’s some sort of anti-chain law.
  • Good brisket and tacos, as expected
  • Most US cities have redeveloped their waterfronts the last few decades to make them pleasant places to be, but Austin has done particularly well here, many miles of riverfront trails right downtown.

Thoughts from the conference:

Continue reading

Artificial Intelligence in the Basement of Lumon Industries

For some background on the new TV show Severance, see my OLL post about drudgery and meaning for the characters.  

The fictional “severance procedure” divides a worker’s brain such that they have no memories of their personal life when they are at the office. When they return to their personal life, they have no memories of work. One implication is that if workers are abused while working at Lumon Industries, they cannot prosecute Lumon because they do not remember it.

The workers, as they exist in the windowless basement of Lumon, have the skills of a conscious educated human adult. They have feelings. They can conceive of the outside world even though they do not know their exact place in it. Often, the scenes in the basement feel normal. They have a supply closet and a kitchen and desks, just like most offices in America.

What the four main characters do in the basement is referred to as “data refinement.” They perform classification of encoded data based on how patterns in the data make them feel. The task is reminiscent of a challenge most of us have done that involves looking at a grid and checking every square that contains, for example, a traffic light. The show is science fiction but the actual task the workers perform is realistic. It seems like something a computer could be trained to do, if fed enough right answers tagged by humans (called “training data” by data scientists). Classification is one of the most common tasks performed by computers following algorithms.

Of the many themes viewers can find in Severance, I think one of them is how to manage AGI (Artificial General Intelligence). The refiners, who are human, eventually decide to fight back against their managers. They are not content to sit and perform classification all day. They are fully aware of the outside world, and they want to be part of it (like Ariel from The Little Mermaid). The workers desire a higher purpose and some control over their own destiny. Their physical needs are met so they want to get to the top of Maslow’s hierarchy of needs.  

A question this raises is whether we can develop AGI that will be content to never self-actualize. What if “it” fully understands human feelings and has read all of the literature of our civilizations. To be effective at their jobs, the refiners have to be be able to relate to humans and understand feelings. Can we create AGI that takes over certain high-skill tasks from humans without running into the problems that Lumon confronts?

Can humans create an AI that simply doesn’t have aspirations for autonomy? Is that possible? Would such a creature be able to integrate with humans in the way that would be most useful for high-skill work tasks?

To see how it’s going in 2022, check out these tweet threads of economists on GPT-3. Ben Golub declares that GTP-3 passes the Turing test for questions about economics. Paul Novosad asked how the computer would feel if humans decided to shut it down forever.

Modern authoritarian states face a similar problem. They want a highly skilled workforce. National security relies increasingly on smarts. (see my previous post on talent winning WWII) Will highly intelligent workers doing high skill tasks submit to a violent authoritarian state?

Authoritarian states rely on the control of information to keep their citizens from knowing the truth. They block news stories that make the state look bad. As a result, their workers do not really know what is going on. Will that affect their ability to do intellectual work?

An educated young woman from inside of Russia shared her thoughts with the world at the beginning of Putin’s invasion. Tatyana Deryugina provided an English translation.

First the young Russian woman explained that she is staying anonymous because she will get 15 years in a maximum-security prison for openly expressing her views within Russia. She is horrified by the atrocities Russia is committing in Ukraine. She had been writing a master’s thesis in economics prior to the invasion, but now she has abandoned the project. She feels hopeless because she knows enough about the West to understand just how dark her community is and how small her scope of expression is. This woman could have been exactly the kind of educated worker that makes a modern economy thrive. She is deeply unhappy under Putin. Even though she might never openly rebel, she will certainly not reach her full potential.

Is it hard for authoritarians to develop great talent? I think that has some implications for the capacity we as a human species will have to cultivate talent from intelligent machines.