Counting Hallucinations by Web-Enabled LLMs

In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon University estimates that half of Americans now use LLMs. In the Spring of 2025, we used the same prompts, based on the JEL categories, to obtain a comprehensive set of responses from LLMs about topics in economics.

Our new report on the state of citations is available at SSRN: “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models

What did we find? Would you expect the models to have improved since 2023? LLMs have gotten better and are passing ever more of what used to be considered difficult tests. (Remember the Turing Test? Anyone?) ChatGPT can pass the bar exam for new lawyers. And yet, if you ask ChatGPT to write a document in the capacity of a lawyer, it will keep making the mistake of hallucinating fake references. Hence, we keep seeing headlines like, “A Utah lawyer was punished for filing a brief with ‘fake precedent’ made up by artificial intelligence

What we call GPT-4o WS (Web Search) in the figure below was queried in April 2025. This “web-enabled” language model is enhanced with real-time internet access, allowing it to retrieve up-to-date information rather than relying solely on static training data. This means it can answer questions about current events, verify facts, and provide live data—something traditional models, which are limited to their last training cutoff, cannot do. While standard models generate responses based on patterns learned from past data, web-enabled models can supplement that with fresh, sourced content from the web, improving accuracy for time-sensitive or niche topics.

At least one third of the references provided by GPT-4o WS were not real! Performance has not significantly improved to the point where AI can write our papers with properly incorporated attribution of ideas. We also found that the web-enabled model would pull from lower quality sources like Investopedia even when we explicitly stated in the prompt, “include citations from published papers. Provide the citations in a separate list, with author, year in parentheses, and journal for each citation.” Even some of the sources that were not journal articles were cited incorrectly. We provide specific examples in our paper.

In closing, consider this quote from an interview with Jack Clark, co-founder of Anthropic:

The best they had was a 60 percent success rate. If I have my baby, and I give her a robot butler that has a 60 percent accuracy rate at holding things, including the baby, I’m not buying the butler.

Illusions of Illusions of Reasoning

Even since Scott’s post on Tuesday of this week, a new response has been launched titled “The Illusion of the Illusion of the Illusion of Thinking

Abstract (emphasis added by me): A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an “accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations in their reasoning capabilities when faced with planning puzzles of increasing complexity. A compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking, argued these findings are not evidence of reasoning failure but rather artifacts of flawed experimental design, such as token limits and the use of unsolvable problems. This paper provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify critical methodological flaws that invalidate the most severe claims of the original paper, their own counter-evidence and conclusions may oversimplify the nature of model limitations. By shifting the evaluation from sequential execution to algorithmic generation, their work illuminates a different, albeit important, capability. We conclude that the original “collapse” was indeed an illusion created by experimental constraints, but that Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

As am writing a new manuscript about hallucination of web-enabled models, this is close to what I am working on. Conjuring up fake academic references might point to a lack of true reasoning ability.

Do Pro and Dantas believe that LLMs can reason? What they are saying, at least, is that evaluating AI reasoning is difficult. In their words, the whole back-and-forth “highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from sophisticated pattern matching of familiar problems…”

The fact that the first sentence of the paper contains the bigram “true reasoning” is interesting in itself. No one dobuts that LLMs are reasoning anymore, at least within their own sandboxes. Hence there have been Champagne jokes going around of this sort:

If you’d like to read a response coming from o3 itself, Tyler pointed me to this:

Waymo and Pictures Online

Memes on Twitter are at least as important today as political cartoons of the 19th and 20th centuries. Two memes went viral this week, surrounding the events of protests in Los Angeles.  

The first meme reflects how Lord of the Rings is deeply embedded in American culture. Two million views means that most people don’t need the joke explained. (Part of the Lord of the Rings story is that the wise and powerful elves abandon chaotic Middle Earth for the safety of the Grey Havens. Similarly, the Waymo cars quietly drove themselves to safety away from Los Angeles after several had been vandalized and burned.)

I worry that The Lord of the Rings has made us too optimistic. Americans rarely tolerate stories that do not have happy endings. Has that made it hard for us to understand global events, or impaired our ability to accurately predict how most battles will end?

The next viral meme about the Waymos has two tweets to track. The originator of the joke got half a million views. Someone who added AI-generated images to the original text got half a million views as well.

The four quadrants form is a common meme format. Starting in the upper left, perhaps the best way to explain this is that a rigid socialist might resent Waymo as a symbol of Big Tech. At the very least, a socialist who wields state control might want to nationalize Waymo if there are going to be autonomous cars.

“Protect the Waymos” voices support for the police and traditional property rules. This is a joke, keep in mind, so it’s not meant to make perfect sense. The Libertarian Right might be more likely to support individuals protecting themselves with their own weapons as opposed to relying on the police state.

Compare the meme world to the news. I feel like this might become quaint soon, so here’s what it looks like to use the Google News search function.

The Los Angeles Times reports

The autonomous ride-hailing service Waymo has suspended operations in downtown Los Angeles after several of its vehicles were set on fire during protests against immigration raids in the area.

At least five Waymo vehicles were destroyed over the weekend, the company said. Waymo removed its vehicles from downtown but continues to operate in other parts of Los Angeles.

The flaming Waymo image is all over the news and internet. For one thing, people are interested in it because it’s real. This was supposed to be the year of deepfakes, and yet it’s mostly real images and real gaffes that are still making the news.

At this moment in time, most Americans do not yet have Waymo in their cities. Oddly enough, losing an inventory of several cars might be well worth the publicity the company is receiving. Two million viewers of the video of the little driverless cars making an orderly exit from Los Angeles might come away thinking driverless cars are safe and sensible.

Here are more posts from my long-standing interest in cartoons and internet culture.

Kyla Scanlon also felt like the little cars on fire was worth a newsletter post. She wrote

We scroll past the burning car to see arguments about whether the burning was justified, then scroll past those to see memes about the arguments, then scroll past those to see counter-memes. The cycle feeds itself!

Papers about Economists Using LLMs

  1. The most recent (published in 2025) is this piece about doing data analytics that would have been too difficult or costly before. Link and title: Deep Learning for Economists

Considering how much of frontier economics revolves around getting new data, this could be important. On the other hand, people have been doing computer-aided data mining for a while. So it’s more of a progression than a revolution, in my expectation.

2. Using LLMs to actually generate original data and/or test hypotheses like experimenters: Large language models as economic agents: what can we learn from homo silicus? and Automated Social Science: Language Models as Scientist and Subjects

3. Generative AI for Economic Research: Use Cases and Implications for Economists

Korinek has a new supplemental update as current as December 2024: LLMs Learn to Collaborate and Reason: December 2024 Update to “Generative AI for Economic Research: Use Cases and Implications for Economists,” Published in the Journal of Economic Literature 61 (4)

4. For being comprehensive and early: How to Learn and Teach Economics with Large Language Models, Including GPT

5. For giving people proof of a phenomenon that many people had noticed and wanted to discuss: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Alert: We will soon have an update for current web-enabled models! It would seem that hallucination rates are going down but the problem is not going away.

6. This was published back in 2023. “ChatGPT ranked in the 91st percentile for Microeconomics and the 99th percentile for Macroeconomics when compared to students who take the TUCE exam at the end of their principles course.” (note the “compared to”): ChatGPT has Aced the Test of Understanding in College Economics: Now What?

References          

Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist69(1), 80-87. https://doi.org/10.1177/05694345231218454 (Original work published 2024)

Cowen, Tyler and Tabarrok, Alexander T., How to Learn and Teach Economics with Large Language Models, Including GPT (March 17, 2023). GMU Working Paper in Economics No. 23-18, Available at SSRN: https://ssrn.com/abstract=4391863 or http://dx.doi.org/10.2139/ssrn.4391863

Dell, M. (2025). Deep Learning for Economists. Journal of Economic Literature, 63(1), 5–58. https://doi.org/10.1257/jel.20241733

Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has Aced the Test of Understanding in College Economics: Now What? The American Economist68(2), 233-245. https://doi.org/10.1177/05694345231169654 (Original work published 2023)

Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv Preprint arXiv:2301.07543.

Korinek, A. (2023). Generative AI for Economic Research: Use Cases and Implications for Economists. Journal of Economic Literature, 61(4), 1281–1317. https://doi.org/10.1257/jel.20231736

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated Social Science: Language Models as Scientist and Subjects (Working Paper No. 32381). National Bureau of Economic Research. https://doi.org/10.3386/w32381

Fast Fashion quotes in Yale’s The Politic

A student reporter reached out to me for quotes about Fast Fashion. The article is now up:

Does the Devil Wear Fast Fashion?: Environmental Costs and Cultural Pushback

by Natalia Armas Perez

Ms. Armas Perez has written a current summary of the fast fashion landscape, including quotes from professors like me and industry executives.

Here’s one quote about me:

Joy Buchanan, an associate professor of Quantitative Analysis and Economics at Samford University, expanded on this idea: “Fast fashion is partly just a sign of a richer world. We can all have more of everything, including customized clothes.” In a world where clothing is cheaper than ever, the desire to reinvent oneself through fashion isn’t just encouraged—it’s expected. “Now that T-shirts are so cheap to make, it’s not surprising that people print them up for a single club event with little thought about the financial or environmental cost. Financially speaking, shirts are almost as disposable as plastic forks,” said Buchanan.

About the publication: The Politic is Yale’s undergraduate journal of politics and culture since 1947. The way people end up finding me to comment on this issue is my original piece for Cato from 2023.

EconTalk Extra on Daisy Christodoulou

I wrote an Extra for the How Better Feedback Can Revolutionize Education (with Daisy Christodoulou) episode.

Can Students Get Better Feedback? is the title of my Extra.

Read the whole thing at the link (ungated), but here are two quotes:

For now, the question is still what kind of feedback teachers can give that really benefits students. Daisy Christodoulou, the guest on this episode, offers a sobering critique of how educators tend to give feedback in education. One of her points is that much of the written feedback teachers give is vague and doesn’t actually help students improve. She shares an example from Dylan William: a middle school student was told he needed to “make their scientific inquiries more systematic.” When asked what he would do differently next time, the student replied, “I don’t know. If I’d known how to be more systematic, I would have been so the first time.” 

Christodoulou also turns to the question many of us are now grappling with: can AI help scale meaningful feedback?

Tariff Tilly (Satire)

Satire news shows are, in my opinion, one of the higher forms of art that my country has produced (and an example of our exports). “Meet Tariff Tilly, the perfect replacement for the 37 dolls your kid does not need” from The Daily Show

“Tariff Tilly” builds. There is even a comment on interest rates (addressed in my previous post).

In this house, we believe in economists writing about dolls. You can find more at https://economistwritingeveryday.com/?s=barbie or https://economistwritingeveryday.com/?s=dolls

Discuss AI Boom with Joy on May 12

I’m not just doing this to plug my own event. It’s also about the only thing on my mind after spending the week leading and moderating this timely discussion.

If you like to read and discuss with smart people, then you can make a free account in the Liberty Fund Portal. If you listen to this podcast over the weekend: Marc Andreessen on Why AI Will Save the World  (2023) you will be up to speed for our asynchronous virtual debate room on Monday May 12.

Keeping in mind the stark contrast between this and the doomers we discussed in the past week, here is Marc’s argument in a nutshell:

“The reason I’m so optimistic is because we know for a fact–as sort of one of the most subtle conclusions in all of science–we know for a fact that in human affairs, intelligence makes everything better. And, by “everything,” I mean basically every outcome of human welfare and life quality that essentially we can measure.”

When it’s put that way, it’s hard to disagree. Who would want less intelligence?

See more details on all readings and the final Zoom meeting in my previous post.

Another interesting bit by Marc:

“By the way, look: there’s lots of work happening that’s not being published in papers. And so, the other part of what we do is to actually talk to the practitioners.”

Even though it might seem strange to look to podcasts instead of published books and papers for cutting edge information, it really does seem like the story was told in human voices for the past 3 years. Dwarkesh was probably the best, but Tyler and Russ deserve credit as well for bringing these conversations out of the closed rooms and into the public domain.

Discuss AI Doom with Joy on May 5

If you like to read and discuss with smart people, then you can make a free account in the Liberty Fund Portal. If you listen to this podcast over the weekend: Eliezer Yudkowsky on the Dangers of AI (2023) you will be up to speed for our asynchronous virtual debate room on Monday May 5.

Russ Roberts sums up the doomer argument using the following metaphor:

The metaphor is primitive. Zinjanthropus man or some primitive form of pre-Homo sapiens sitting around a campfire and human being shows up and says, ‘Hey, I got a lot of stuff I can teach you.’ ‘Oh, yeah. Come on in,’ and pointing out that it’s probable that we are either destroyed directly by murder or maybe just by out-competing all the previous hominids that came before us, and that in general, you wouldn’t want to invite something smarter than you into the campfire.

What do you think of this metaphor? By incorporating AI agents into society, are we inviting a smarter being to our campfire? Is it likely to eventually kill us out of contempt or neglect? That will be what we are discussing over in the Portal this week.

Is your P(Doom) < 0.05? Great – that means you believe that the probability of AI turning us into paperclips is less than 5%. Come one come all. You can argue against doomers during the May 5-9 week of Doom and then you will love Week Two. On May 12-16, we will make the optimistic case for AI!

See more details on all readings and the final Zoom meeting in my previous post.

When Commitment Backfires: The Economics Behind Gang Tattoos and Changing Incentives

In economics, commitment devices are often seen as clever solutions to self-control problems—ways people can tie their future hands to avoid giving in to temptation. A smoker throws away their cigarettes, a dieter pays in advance for healthy meals, a student announces a deadline publicly so they can’t back out. The idea is that by limiting future choices, a person can force themselves to stick with a preferred long-term strategy. But commitment devices also show up in places far removed from personal productivity—and in some cases, they carry bad unintended consequences when the strategic landscape shifts.

Consider the case of gang tattoos, especially those associated with MS-13. For years, highly visible tattoos served as a powerful way to demonstrate loyalty to the group. These tattoos—sometimes covering the face, neck, or arms—weren’t just aesthetic. They signaled that the individual was fully committed to the gang. In economic terms, they functioned as a high-cost, hard-to-fake commitment device. By making oneself easily identifiable as a gang member, a person burned bridges to legitimate employment or life outside the gang. That might seem irrational at first glance, but it was often a rational decision in context. Within certain neighborhoods or prisons, that signal provided protection, status, and trust among peers. The visible commitment reduced the gang’s uncertainty about who was loyal and who might defect.

But the rules of the game changed. In March 2022, El Salvador launched an aggressive crackdown on gangs following a sharp spike in homicides. Under a sweeping “state of exception,” authorities suspended constitutional rights, arrested tens of thousands of people, and expanded prison capacity dramatically. Tattoos quickly became one of the easiest ways for police to identify and detain suspected gang members. News reports describe men being pulled from buses or homes not for current criminal activity, but simply because of the ink on their skin. In many cases, the tattoos were from years earlier—when the wearer had been young and immersed in a world where signaling loyalty felt necessary for survival. Now, those same signals serve as evidence in court or grounds for indefinite detention.

Continue reading