ChatGPT Cites Economics Papers That Do Not Exist

EDIT: See my new published paper on this topic “ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

This blog post is co-authored with graduate student Will Hickman.

EDIT: Will and I now have a paper on trusting ChatGPT “Do People Trust Humans More Than ChatGPT?

Although many academic researchers don’t enjoy writing literature reviews and would like to have an AI system do the heavy lifting for them, we have found a glaring issue with using ChatGPT in this role. ChatGPT will cite papers that don’t exist. This isn’t an isolated phenomenon – we’ve asked ChatGPT different research questions, and it continually provides false and misleading references. To make matters worse, it will often provide correct references to papers that do exist and mix these in with incorrect references and references to nonexistent papers. In short, beware when using ChatGPT for research.

Below, we’ve shown some examples of the issues we’ve seen with ChatGPT. In the first example, we asked ChatGPT to explain the research in experimental economics on how to elicit attitudes towards risk. While the response itself sounds like a decent answer to our question, the references are nonsense. Kahneman, Knetsch, and Thaler (1990) is not about eliciting risk. “Risk Aversion in the Small and in the Large” was written by John Pratt and was published in 1964. “An Experimental Investigation of Competitive Market Behavior” presumably refers to Vernon Smith’s “An Experimental Study of Competitive Market Behavior”, which had nothing to do with eliciting attitudes towards risk and was not written by Charlie Plott. The reference to Busemeyer and Townsend (1993) appears to be relevant.

Although ChatGPT often cites non-existent and/or irrelevant work, it sometimes gets everything correct. For instance, as shown below, when we asked it to summarize the research in behavioral economics, it gave correct citations for Kahneman and Tversky’s “Prospect Theory” and Thaler and Sunstein’s “Nudge.” ChatGPT doesn’t always just make stuff up. The question is, when does it give good answers and when does it give garbage answers?

Strangely, when confronted, ChatGPT will admit that it cites non-existent papers but will not give a clear answer as to why it cites non-existent papers. Also, as shown below, it will admit that it previously cited non-existent papers, promise to cite real papers, and then cite more non-existent papers. 

We show the results from asking ChatGPT to summarize the research in experimental economics on the relationship between asset perishability and the occurrence of price bubbles. Although the answer it gives sounds coherent, a closer inspection reveals that the conclusions ChatGPT reaches do not align with theoretical predictions. More to our point, neither of the “papers” cited actually exist.  

Immediately after getting this nonsensical answer, we told ChatGPT that neither of the papers it cited exist and asked why it didn’t limit itself to discussing papers that exist. As shown below, it apologized, promised to provide a new summary of the research on asset perishability and price bubbles that only used existing papers, then proceeded to cite two more non-existent papers. 

Tyler has called these errors “hallucinations” of ChatGPT. It might be whimsical in a more artistic pursuit, but we find this form of error concerning. Although there will always be room for improving language models, one thing is very clear: researchers be careful. This is something to keep in mind, also, when serving as a referee or grading student work.

Waxing Crescent: New Orleans 2013-2023

The scars of Hurricane Katrina were still obvious eight years afterward when I moved to New Orleans in 2013. Where I lived in Mid-City, it seemed like every block had an abandoned house or an empty lot, and the poorer neighborhoods had more than one per block. Even many larger buildings were left abandoned, including high-rises.

Since then, recovery has continued at a steady pace. The rebuilding was especially noticeable when I spent a few days there recently for the first time since moving away in 2017. The airport has been redone, with shining new connected terminals and new shops. The abandoned high-rise at the prime location where Canal St meets the Mississippi has been renovated into a Four Seasons. Tulane Ave is now home to a nearly mile-long medical complex, stretching from the old Tulane hospital to the new VA and University Medical Center complex. There are several new mid-sized health care facilities, but most striking is that Tulane claims to finally be renovating the huge abandoned Charity Hospital:

Old Charity Hospital, January 2023

The new VA hospital opened in 2016 as mostly new construction, but they’ve now managed to fully incorporate the remnants of the abandoned Dixie Beer brewery:

VA Hospital incorporating old Dixie Beer tower, January 2023

Dixie beer itself opened a new beer garden in New Orleans East, and just renamed itself Faubourg Brewery. Some streets named for Confederates have also been renamed, though you can still see plenty of signs of the past, like the “Jeff Davis Properties” building on the street renamed from Jefferson Davis Parkway to Norman C Francis Parkway.

Other big additions I noticed are the new Childrens’ Museum and the greatly expanded sculpture garden in City Park:

Of course, even with all the improvements, many problems remain, both in terms of things that still haven’t recovered from the hurricane, and the kind of problems that were there even before Katrina. The one remaining abandoned high-rise, Plaza Tower, was actually abandoned even before Katrina.

My overall impression is that large institutions (university medical centers, the VA, the airport, museums, major hotels) have been driving this phase of the recovery. The neighborhoods are also recovering, but more slowly, particularly small business. Population is still well below 2005 levels. I generally think inequality has been overrated in national discussions of the last 15 years relative to concerns about poverty and overall prosperity, but even to me New Orleans is a strikingly unequal city; there’s so much wealth alongside so many people seeming to get very little benefit from it.

The most persistent problems are the ones that remain from before Katrina: the roads, the schools, and the crime; taken together, the dysfunctional public sector. Everywhere I’ve lived people complain about the roads, but I’ve lived a lot of places and New Orleans roads were objectively the worst, even in the nice parts of town, and it isn’t close. The New Orleans Police Department is still subject to a federal consent decree, as it has been since 2012. The murder rate in 2022 was the highest in the nation. Building an effective public sector seems to be much harder than rebuilding from a hurricane.

As much as things have changed since 2013, my overall assessment of the city remains the same: its unlike anywhere else in America. It is unparalleled in both its strengths and its weaknesses. If you care about food, drink, music, and having a good time, its the place to be. If you’re more focused more on career, health, or safety, it isn’t. People who fled Katrina and stayed in other cities like Houston or Atlanta wound up richer and healthier. But not necessarily happier.

Highlights from ASSA 2023

I expected the meetings would shrink, but I was still surprised by how much they did:

That said, I mostly didn’t notice the smaller numbers on the ground, because most of the missing people are those on the job market, who used to spend most of their time shut away doing interviews anyway. There was still a huge variety of sessions and most seemed well-attended. ASSAs is also still unparalleled for pulling in top names to give talks; I got to talk to Nobel laureate Roger Myerson at a reception. But there may be a trend of the big names being more likely to stay remote:

The big problem with attendance falling to 6k is that they’ve planned years worth of meetings with the assumption of 12k+ attendance. Getting one year further from Covid and dropping mask and vaccine mandates might help some, but the core issue is that 1st-round job interviews have gone remote and aren’t coming back. The best solution I can think of is raising the acceptance rate for papers, which in recent history has been well under 20%.

In terms of the actual economic research, two sessions stood out to me:

How many factors are there in the stock market? Classic work by Fama and French argues for 3 (size, value, and market risk), but the finance literature as a whole has identified a “zoo” of over 500. Two papers presented one after the other at ASSA argued for two extremes. “Time Series Variation in the Factor Zoo” argues that the number of factors varies over time, but is quite high, typically over 20 and sometimes over 100:

In contrast, “Three Common Factors” argues that there really are just 3 factors, though they are latent and not the same as the Fama-French 3 factors. In this case, the whole zoo of factors in the literature is mostly non-robust results driven by p-hacking and a desire to find more factors (fortune and fame potentially await those who do). Overall these asset pricing papers make me want to look into all this myself; when reading them I’m always struck by an odd mix of reactions- “I don’t understand that”, “why would you do it that way, it seems wrong and unnecessarily complicated”, and “why didn’t the field settle such a seemingly basic question decades ago?”.

Hayek: A Life this session covered the new book by Bruce Caldwell (who taught me much of what I know of the history of economic thought) and Hansjoerg Klausinger. Discussants Emily Skarbek and Stephen Durlauf agreed it is surprisingly readable for a long work of original scholarship, calling it a beautifully written 800p pageturner. Vernon Smith asked Caldwell if Hayek read the Theory of Moral Sentiments. Caldwell: “he cited it.” Smith: “but did he read it? Seems like he didn’t understand it very well.” Caldwell agreed he may not have, or if he did it was a German translation.

Vernon Smith’s own talk featured great comments on market instability: instability in markets comes from retrading. Markets are stable when consumers just value goods for their use, like haircuts and hamburgers. The craziness and potential for bubbles and crashes comes in when people are thinking about reselling something, whether it be tulips, stocks, houses, or crypto.

I asked Bruce Caldwell at a reception how he was able to finish writing such a big book that involved lots of archival work and original research. He said “one chapter at a time”, and noted that its fine to write the easiest chapters first to get the ball rolling.

Overall, while ASSA is diminished from the pre-Covid days and I often disagree with the AEAs decisions, its still a top-tier conference, especially when in New Orleans.

College Major, Marriage, and Children Update

In a May post I described a paper my student my student had written on how college majors predict the likelihood of being married and having children later in life.

Since then I joined the paper as a coauthor and rewrote it to send to academic journals. I’m now revising it to resubmit to a journal after referee comments. The best referee suggestion was to move our huge tables to an appendix and replace them with figures. I just figured out how to do this in Stata using coefplot, and wanted to share some of the results:

Points represent marginal effects of coefficient estimates from Logit regressions estimating the effect of college major on marriage rates relative to non-college-graduates. All regressions control for sex, race, ethnicity, age, and state of residence. MarriedControls additionally controls for personal income, family income, employment status, and number of children. Married (blue points) includes all adults, others include only 40-49 year-olds. Lines through points represent 95% confidence intervals.
Points represent coefficient estimates from Poisson regressions estimating the effect of college major on the number of children in the household relative to non-college-graduates. All regressions control for sex, race, ethnicity, age, and state of residence. ChildrenControls additionally controls for personal income, family income, employment status, and number of children. Children (blue points) includes all adults, others include only 40-49 year-olds. Lines through points represent 95% confidence intervals.

Many details have changed since Hannah’s original version, and a lot depends on the exact specification used. But 3 big points from the original paper still stand:

  1. Almost all majors are more likely to be married than non-college-graduates
  2. The association of college education with childbearing is more mixed than its almost-uniformly-positive association with marriage
  3. College education is far from uniform; differences between some majors are larger than the average difference between college graduates and non-graduates

Empirical Papers for Undergraduate Statistics Students

Once undergraduates have learned the basics of interpreting regression results, we would like to introduce them to the world of economics research papers. Reading these papers will help reinforce the statistical concepts, and also we want them to get access to the insights in the literature.

Many empirical papers in economics are too long or too difficult to assign to undergraduates, especially if the course is focused more on analytics than economics specifically. Here I provide materials and instructions for teaching two published econ articles to undergraduates. Assume the students have learned the basics of interpreting a regression model (perhaps from a course textbook) but have had few opportunities to apply theses skills or engage in scientific literature.

“The Effects of Attendance on Student Learning in Principles of Economics” is only 4 pages long! Students do not need to read past page 7 of “My Reference Point, Not Yours” to answer the reading guide questions. So, these readings can be assigned outside of class, but I did some of the reading during our class period.

Handing out printed copies of at least one of the papers and my guided questions can make a good classroom activity. If students do not have experience reading tables of regression results, it can be useful to do it together in person.

The questions in the reading guide help students to identify the main variables and hypotheses. Then, students are asked to pull specific results from the tables in the papers. You can customize this list of questions by deleting lines if you do not want to discuss issues like non-linear effects or the null hypothesis.

I provide links below. First is the reading guide with about 30 short-answer questions about the two articles.

  1. Link to download the reading guide that goes with both papers, starting with the shorter one.

2. This is a web link to download the Effects of Attendance paper. (4 pages long and the topic is relatable to undergraduates)

3. Two web sources for “My Reference Point, Not Yours” (15 pages in total in the JEBO manuscript, but students do not need to read past page 7 for this exercise, and they can skip the Literature Review section)

JEBO link: https://www.sciencedirect.com/science/article/abs/pii/S0167268120300299

SSRN working paper link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3434182

Mises’s Interventionism, A Recap

I suspect that Mises may have felt somewhat restless after writing Socialism. He had taken a very good stab at describing the socialist economy and its inadequacy for the promotion of human flourishing. By 1940 fascism had arisen in both Italy and in Germany, who Mises considered the clear antagonists of World War II. Further, the communist Soviets were allied with Germany at the time of writing Interventionism.

A communist-fascist alliance may seem strange to idealogues, but it appeared quite natural to Mises that the two distasteful versions of socialism should find cooperation convenient to achieve their own ends. In America, the revelations of German atrocities had yet to arrive and there were many sympathizers with both Russia and Germany. In Britain, union leaders were promoting the idea of socialism as a reward to the public who would be bearing the costs of the war.

Mises thought that the disfunction of socialism was adequate to describe its ultimate failure as an economic system. However, socialist tendencies were pervasive in the liberal market economies among both idealogues and demagogues enough to make the transition to socialism a very real threat. After all, while socialism may not be a stable regime in a dynamic world, certain features within specific market economies may nonetheless tend toward it. What is the cause of such tendencies?

Continue reading

New Data: State Regulatory Procedures

Released this April, but I just heard about it today. Researchers did the painstaking work of going through all 50 states to determine which steps must be taken in each state before new regulations can take effect. For instance, it turns out half of states require economic analysis for new regulations, and half don’t. The paper is here: https://www.mercatus.org/publications/regulation/50-state-review-regulatory-procedures

New Double Auction Paper

This weekend I am at the Economic Science Association meeting.

Most of the economists in this group use experiments as part of their empirical research. In this post I will highlight some recently published work that is in the tradition of Vernon Smith, who influenced all of us so much.

Martinelli, C., Wang, J. & Zheng, W. Competition with indivisibilities and few traders. Experimental Economics (2022). https://doi.org/10.1007/s10683-022-09772-9

Abstract: We study minimal conditions for competitive behavior with few agents. We adapt a price-quantity strategic market game to the indivisible commodity environment commonly used in double auction experiments, and show that all Nash equilibrium outcomes with active trading are competitive if and only if there are at least two buyers and two sellers willing to trade at every competitive price. Unlike previous formulations, this condition can be verified directly by checking the set of competitive equilibria. In laboratory experiments, the condition we provide turns out to be enough to induce competitive results, and the Nash equilibrium appears to be a good approximation for market outcomes. Subjects, although possessing limited information, are able to act as if complete information were available in the market.

This small excerpt from their results shows a market converging toward equilibrium over time, under different treatment conditions. With some opportunities for practice and feedback, agents create surplus value by trading.

Figure 4 plots the average efficiency in each round in the four treatments. Efficiency is defined as the percentage of the maximum social surplus realized. … learning takes longer under the clearing house institution; hence, average efficiency under the clearing house institution presents a stronger upward trend over time. Under the clearing house institution, the average efficiencies start at levels lower than under the double auction institution, and remain statistically lower in the second half of the experiment. Nevertheless, we can observe from Fig. 4 that the upward trend of the efficiencies in clearing house treatments persist over time, and at the end of the experiment, the efficiency levels from the two institutions are close.

But Who Will Build the Roads? 19th Century Edition

In the United States and much of the developed world today, most roads are publicly provided, i.e., they are built and operated by governments. This is not exclusively true, as many private toll roads exist, but the vast majority of roads are owned and operated by governments. Must it be this way?

A recent working paper by Alan Rosevear, Dan Bogart, and Leigh Shaw-Taylor looks at a very important case study: Britain in the 19th century. Britain is important because they were the leading economy in the world at the time, at the forefront of the Industrial Revolution. How were roads built and improved in England and Wales at this time? Here’s what the authors have to say in the abstract:

“non-profit organizations, known as turnpike trusts, built more new roads by attracting private investors and capable surveyors. We also show the Government Mail Road had the highest quality. Nevertheless, most turnpike trust roads were good quality, indicating their practical achievements.”

In the conclusion of the paper, they further add:

“Our analysis demonstrates that turnpike trusts were responsible for building 4,000 miles of new, good quality road in England and Wales, much of it between 1810 and 1838. On a directly comparable basis, the not-for-profit trusts built thirty times the mileage than had been built with direct Government funding during the early 1800s.”

To be clear, this paper is not a completely new discovery. It was already well-known that private companies built roads in Britain, as the authors make clear in their literature review. Similarly, there were many private turnpikes and toll roads in the US in the 19th century, as summarized in an encyclopedia entry by Klein and Majewski.

The Rosevear et al. paper adds new important details. First, they document the extent of private road building and improvements in the 19th century. Second, they show that these roads were generally of good quality, or at least they were of good quality for the time. Prior research had not documented these facts, thus making this a very important advance in our understanding of this time period. But perhaps more importantly, we see the possibility that many more roads today could be privately built and funded with user fee, especially considering that we are much, much wealthier today than 19th century Britain, we have more extensive and functional capital markets for raising the funds, etc.

An intervention for children to change perceptions of STEM

Here is a a new paper related to the topic of women getting into technical fields (see previous post on my paper about programming).

Grosch, Kerstin, Simone Haeckl, and Martin G. Kocher. “Closing the gender STEM gap-A large-scale randomized-controlled trial in elementary schools.” (2022).

These authors were thinking about the same problem at the same time, unbeknownst to me. In their introduction they write, “We currently know surprisingly little about why women still remain underrepresented in STEM fields and which interventions might work to close the gender STEM gap.”

My conclusion from my paper is that, by college age, subjective attitudes toward tech are very important. This leads to the questions of whether those subjective attitudes are shaped at younger ages. Grosch et al. have run an experiment to target 3rd-graders with a STEM-themed game. I’ll quote their description:

The treatment web application (treatment app) intends to increase interest in STEM directly by increasing knowledge and awareness about STEM professions and indirectly by addressing the underlying behavioral mechanisms that could interfere with the development of interest in STEM. The treatment app presents both fictitious and real STEM professionals, such as engineers and programmers, on fantasy planets. Accompanied by the professionals, the children playfully learn more about various societal challenges, such as threats from climate change and to public health, and how STEM skills can contribute to combating them. The storyline of the app comprises exercises, videos, and texts. The app also informs children about STEM-related content in general. To address the behavioral mechanisms, the app uses tutorials, exercises, and (non-monetary) rewards that teach children a growth mindset and improve their self-confidence and competitive aptitude. Moreover, the app introduces female STEM role models to overcome stereotypical beliefs. To test the app’s effect, we recruited 39 elementary schools in Vienna (an urban area) and Upper Austria (a predominantly rural area).

This is a preview of their results, although I recommend reading their paper to understand how these measurements were made:

Girls’ STEM confidence increases significantly in the treatment group (difference: 0.047 points or 0.28 standard deviations, p = 0.002, Wald test), and the effect for girls is significantly larger than the effect for boys.

Result 2: Children’s competitiveness is positively associated with children’s interest in STEM. We do not find evidence that stereotypical thinking and a growth mindset is associated with STEM interest.

Lastly, my kids play STEM-themed tablet games. PBS Kids has a great suite of games that are free and educational. Unfortunately, I have not tried to treat one kid while giving the other kid a placebo app, so my ability to do causal inference is limited.