Will LLMs get us the Missing Data for Solving Physics?

Tyler suggested that a “smarter” LLM could not master the unconquered intellectual territory of integrating general relatively and quantum mechanics.

Forget passing Ph.D. level qualifying exams. (j/k James) Are the AI’s going to significantly surpass human efforts in generating new knowledge?

What exactly is the barrier to solving the fundamental mysteries of physics? How do we experimentally confirm that all matter breaks down to vibrating strings?

In a podcast episode of Within Reason, Brian Greene says that we can imagine an experiment that would test the proposed unifying String Theory. The Large Hadron Collider is not big enough (17 miles in circumference is too small). We would need a particle accelerator as big as a galaxy.

ChatGPT isn’t going to get us there. However, Brian Greene did suggest that there is a possibility that an advance in mathematics could get us closer to being able to work with the data we have.

Beh Yeoh summarized what he heard from Tyler et al. at a live event on how fast the acceleration in our knowledge will get boosted from AI. They warned that some areas will hit bottlenecks and therefore not advance very fast. Anything that require clinical trials, for example, isn’t going to proceed at breakneck speed. Ben warns that “Protein folding was a rare success” so we shouldn’t get too too excited about acceleration in biotech. If advances in physics require bigger and better physical tools to do more advanced experimental observations, then new AI might not get us far.

However, one of the categories that made Yeoh’s list of where new AI might accelerate progress is “mathematics,” because developing new theories does not face the same kind of physical constraints.

So, we are unlikely to obtain new definitive tests of String Theory to the extent that it is a capital-intensive field. The scenario for AI advances to bring a solution to this empirical question in my lifetime is probably if the solution comes from advances in mathematics so that we can reduce our reliance on new observational data.

Related links:
my article for the Gospel Coalition – We are not “building God,” despite some claims.
my article for EconLog – AI will be constrained by the same problem that David Hume faced. AI can predict what is likely to occur in the future based on what it has observed in the past.

“The big upward trend in Generative AI/LLM tool use in 2025 continues but may be slowing.” Have we reached a plataue, at least temporarily? Have we experienced the big upswing already in productivity, and it’s going to level out now? At least programming will be less painful forever after?

LLM Hallucination of Citations in Economics Persists with Web-Enabled Models” I realize that, as of today, you can pay for yet-better models than what we tested. But if web-enabled 4o can’t cite Krugman properly, you do wonder if “6o” will be integrating general relatively and quantum mechanics. A slightly longer context window probably isn’t going to do it.

Chesterton Right about the History of Patriotism

Unexpectedly, Chesterton on Patriotism from 2021 is one of my all-time top performing posts due to a slow but steady drip of Google Search hits.

In 1908, G.K. Chesterton published the following line in Orthodoxy,

This, as a fact, is how cities did grow great. Go back to the darkest roots of civilization and you will find them knotted round some sacred stone or encircling some sacred well.

By 1908, Chesterton had likely been exposed to Victorian early anthropological thinkers like Tylor and Frazer. Maybe I shouldn’t be impressed that he’d get it right, but I don’t think of Chesterton as having access to the best and latest evidence for how human civilization evolved.

I was browsing the book Sapiens (2011) this week and came across:

In the conventional picture, pioneers first built a village, and when it prospered, they set up a temple in the middle. But Göbekli Tepe suggests that the temple may have been built first, and that a village later grew up around it. (pg 102)

Today’s post is dedicated to congratulating Chesterton on making a conjecture that turns out to line up with the best we now know and archeological evidence that was only discovered in 1995.

Chesterton wrote,

The only way out of it seems to be for somebody to love Pimlico; to love it with a transcendental tie and without any earthly reason. If there arose a man who loved Pimlico, then Pimlico would rise into ivory towers and golden pinnacles… If men loved Pimlico as mothers love children, arbitrarily, because it is theirs, Pimlico in a year or two might be fairer than Florence.

Also this month I witnessed Americans celebrating the 4th of July. People here love this country “because it is theirs.”

I’ve heard a lot of panicking in the past 10 years about the fate of the nation, and I think we should always be in a partial state of paranoia. But, if love of country is needed in the recipe, we’ve still got it. (you might need an Instagram account to view Mark Zuckerberg Zuck wakeboarding in a bald eagle suit)

Hallucination as a User Error

You don’t use a flat head screwdriver to drill a hole in a board. You should know to use a drill.

I appreciate getting feedback on our manuscript, “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models,” via X/Twitter. @_jannalulu wrote: “that paper only tested 4o (which arguably is a bad enough model that i almost never use it).”

Since the scope and frequency of hallucinations came as a surprise to many LLM users, they have often been used as a ‘gotcha’ to criticize AI optimists. People, myself included, have sounded the alarm that hallucinations could infiltrate articles, emails, and medical diagnoses.

The feedback I got from power users on Twitter this week made me think that there might be a cultural shift in the medium term. (Yes, we are always looking for someone to blame.) Hallucinations will be considered the fault of the human user who should have:

  1. Used a better model (learn your tools)
  2. Written a better prompt (learn how to use your tools)
  3. Assigned the wrong task to LLMs (it’s been known for over 2 years that general LLM models hallucinate citations). What did you expect from “generative” AI? LLMs are telling you what literature ought to exist as opposed to what does exist.

Counting Hallucinations by Web-Enabled LLMs

In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon University estimates that half of Americans now use LLMs. In the Spring of 2025, we used the same prompts, based on the JEL categories, to obtain a comprehensive set of responses from LLMs about topics in economics.

Our new report on the state of citations is available at SSRN: “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models

What did we find? Would you expect the models to have improved since 2023? LLMs have gotten better and are passing ever more of what used to be considered difficult tests. (Remember the Turing Test? Anyone?) ChatGPT can pass the bar exam for new lawyers. And yet, if you ask ChatGPT to write a document in the capacity of a lawyer, it will keep making the mistake of hallucinating fake references. Hence, we keep seeing headlines like, “A Utah lawyer was punished for filing a brief with ‘fake precedent’ made up by artificial intelligence

What we call GPT-4o WS (Web Search) in the figure below was queried in April 2025. This “web-enabled” language model is enhanced with real-time internet access, allowing it to retrieve up-to-date information rather than relying solely on static training data. This means it can answer questions about current events, verify facts, and provide live data—something traditional models, which are limited to their last training cutoff, cannot do. While standard models generate responses based on patterns learned from past data, web-enabled models can supplement that with fresh, sourced content from the web, improving accuracy for time-sensitive or niche topics.

At least one third of the references provided by GPT-4o WS were not real! Performance has not significantly improved to the point where AI can write our papers with properly incorporated attribution of ideas. We also found that the web-enabled model would pull from lower quality sources like Investopedia even when we explicitly stated in the prompt, “include citations from published papers. Provide the citations in a separate list, with author, year in parentheses, and journal for each citation.” Even some of the sources that were not journal articles were cited incorrectly. We provide specific examples in our paper.

In closing, consider this quote from an interview with Jack Clark, co-founder of Anthropic:

The best they had was a 60 percent success rate. If I have my baby, and I give her a robot butler that has a 60 percent accuracy rate at holding things, including the baby, I’m not buying the butler.

Illusions of Illusions of Reasoning

Even since Scott’s post on Tuesday of this week, a new response has been launched titled “The Illusion of the Illusion of the Illusion of Thinking

Abstract (emphasis added by me): A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an “accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations in their reasoning capabilities when faced with planning puzzles of increasing complexity. A compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking, argued these findings are not evidence of reasoning failure but rather artifacts of flawed experimental design, such as token limits and the use of unsolvable problems. This paper provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify critical methodological flaws that invalidate the most severe claims of the original paper, their own counter-evidence and conclusions may oversimplify the nature of model limitations. By shifting the evaluation from sequential execution to algorithmic generation, their work illuminates a different, albeit important, capability. We conclude that the original “collapse” was indeed an illusion created by experimental constraints, but that Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

As am writing a new manuscript about hallucination of web-enabled models, this is close to what I am working on. Conjuring up fake academic references might point to a lack of true reasoning ability.

Do Pro and Dantas believe that LLMs can reason? What they are saying, at least, is that evaluating AI reasoning is difficult. In their words, the whole back-and-forth “highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from sophisticated pattern matching of familiar problems…”

The fact that the first sentence of the paper contains the bigram “true reasoning” is interesting in itself. No one dobuts that LLMs are reasoning anymore, at least within their own sandboxes. Hence there have been Champagne jokes going around of this sort:

If you’d like to read a response coming from o3 itself, Tyler pointed me to this:

Waymo and Pictures Online

Memes on Twitter are at least as important today as political cartoons of the 19th and 20th centuries. Two memes went viral this week, surrounding the events of protests in Los Angeles.  

The first meme reflects how Lord of the Rings is deeply embedded in American culture. Two million views means that most people don’t need the joke explained. (Part of the Lord of the Rings story is that the wise and powerful elves abandon chaotic Middle Earth for the safety of the Grey Havens. Similarly, the Waymo cars quietly drove themselves to safety away from Los Angeles after several had been vandalized and burned.)

I worry that The Lord of the Rings has made us too optimistic. Americans rarely tolerate stories that do not have happy endings. Has that made it hard for us to understand global events, or impaired our ability to accurately predict how most battles will end?

The next viral meme about the Waymos has two tweets to track. The originator of the joke got half a million views. Someone who added AI-generated images to the original text got half a million views as well.

The four quadrants form is a common meme format. Starting in the upper left, perhaps the best way to explain this is that a rigid socialist might resent Waymo as a symbol of Big Tech. At the very least, a socialist who wields state control might want to nationalize Waymo if there are going to be autonomous cars.

“Protect the Waymos” voices support for the police and traditional property rules. This is a joke, keep in mind, so it’s not meant to make perfect sense. The Libertarian Right might be more likely to support individuals protecting themselves with their own weapons as opposed to relying on the police state.

Compare the meme world to the news. I feel like this might become quaint soon, so here’s what it looks like to use the Google News search function.

The Los Angeles Times reports

The autonomous ride-hailing service Waymo has suspended operations in downtown Los Angeles after several of its vehicles were set on fire during protests against immigration raids in the area.

At least five Waymo vehicles were destroyed over the weekend, the company said. Waymo removed its vehicles from downtown but continues to operate in other parts of Los Angeles.

The flaming Waymo image is all over the news and internet. For one thing, people are interested in it because it’s real. This was supposed to be the year of deepfakes, and yet it’s mostly real images and real gaffes that are still making the news.

At this moment in time, most Americans do not yet have Waymo in their cities. Oddly enough, losing an inventory of several cars might be well worth the publicity the company is receiving. Two million viewers of the video of the little driverless cars making an orderly exit from Los Angeles might come away thinking driverless cars are safe and sensible.

Here are more posts from my long-standing interest in cartoons and internet culture.

Kyla Scanlon also felt like the little cars on fire was worth a newsletter post. She wrote

We scroll past the burning car to see arguments about whether the burning was justified, then scroll past those to see memes about the arguments, then scroll past those to see counter-memes. The cycle feeds itself!

Papers about Economists Using LLMs

  1. The most recent (published in 2025) is this piece about doing data analytics that would have been too difficult or costly before. Link and title: Deep Learning for Economists

Considering how much of frontier economics revolves around getting new data, this could be important. On the other hand, people have been doing computer-aided data mining for a while. So it’s more of a progression than a revolution, in my expectation.

2. Using LLMs to actually generate original data and/or test hypotheses like experimenters: Large language models as economic agents: what can we learn from homo silicus? and Automated Social Science: Language Models as Scientist and Subjects

3. Generative AI for Economic Research: Use Cases and Implications for Economists

Korinek has a new supplemental update as current as December 2024: LLMs Learn to Collaborate and Reason: December 2024 Update to “Generative AI for Economic Research: Use Cases and Implications for Economists,” Published in the Journal of Economic Literature 61 (4)

4. For being comprehensive and early: How to Learn and Teach Economics with Large Language Models, Including GPT

5. For giving people proof of a phenomenon that many people had noticed and wanted to discuss: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Alert: We will soon have an update for current web-enabled models! It would seem that hallucination rates are going down but the problem is not going away.

6. This was published back in 2023. “ChatGPT ranked in the 91st percentile for Microeconomics and the 99th percentile for Macroeconomics when compared to students who take the TUCE exam at the end of their principles course.” (note the “compared to”): ChatGPT has Aced the Test of Understanding in College Economics: Now What?

References          

Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist69(1), 80-87. https://doi.org/10.1177/05694345231218454 (Original work published 2024)

Cowen, Tyler and Tabarrok, Alexander T., How to Learn and Teach Economics with Large Language Models, Including GPT (March 17, 2023). GMU Working Paper in Economics No. 23-18, Available at SSRN: https://ssrn.com/abstract=4391863 or http://dx.doi.org/10.2139/ssrn.4391863

Dell, M. (2025). Deep Learning for Economists. Journal of Economic Literature, 63(1), 5–58. https://doi.org/10.1257/jel.20241733

Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has Aced the Test of Understanding in College Economics: Now What? The American Economist68(2), 233-245. https://doi.org/10.1177/05694345231169654 (Original work published 2023)

Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv Preprint arXiv:2301.07543.

Korinek, A. (2023). Generative AI for Economic Research: Use Cases and Implications for Economists. Journal of Economic Literature, 61(4), 1281–1317. https://doi.org/10.1257/jel.20231736

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated Social Science: Language Models as Scientist and Subjects (Working Paper No. 32381). National Bureau of Economic Research. https://doi.org/10.3386/w32381

Fast Fashion quotes in Yale’s The Politic

A student reporter reached out to me for quotes about Fast Fashion. The article is now up:

Does the Devil Wear Fast Fashion?: Environmental Costs and Cultural Pushback

by Natalia Armas Perez

Ms. Armas Perez has written a current summary of the fast fashion landscape, including quotes from professors like me and industry executives.

Here’s one quote about me:

Joy Buchanan, an associate professor of Quantitative Analysis and Economics at Samford University, expanded on this idea: “Fast fashion is partly just a sign of a richer world. We can all have more of everything, including customized clothes.” In a world where clothing is cheaper than ever, the desire to reinvent oneself through fashion isn’t just encouraged—it’s expected. “Now that T-shirts are so cheap to make, it’s not surprising that people print them up for a single club event with little thought about the financial or environmental cost. Financially speaking, shirts are almost as disposable as plastic forks,” said Buchanan.

About the publication: The Politic is Yale’s undergraduate journal of politics and culture since 1947. The way people end up finding me to comment on this issue is my original piece for Cato from 2023.

EconTalk Extra on Daisy Christodoulou

I wrote an Extra for the How Better Feedback Can Revolutionize Education (with Daisy Christodoulou) episode.

Can Students Get Better Feedback? is the title of my Extra.

Read the whole thing at the link (ungated), but here are two quotes:

For now, the question is still what kind of feedback teachers can give that really benefits students. Daisy Christodoulou, the guest on this episode, offers a sobering critique of how educators tend to give feedback in education. One of her points is that much of the written feedback teachers give is vague and doesn’t actually help students improve. She shares an example from Dylan William: a middle school student was told he needed to “make their scientific inquiries more systematic.” When asked what he would do differently next time, the student replied, “I don’t know. If I’d known how to be more systematic, I would have been so the first time.” 

Christodoulou also turns to the question many of us are now grappling with: can AI help scale meaningful feedback?

Tariff Tilly (Satire)

Satire news shows are, in my opinion, one of the higher forms of art that my country has produced (and an example of our exports). “Meet Tariff Tilly, the perfect replacement for the 37 dolls your kid does not need” from The Daily Show

“Tariff Tilly” builds. There is even a comment on interest rates (addressed in my previous post).

In this house, we believe in economists writing about dolls. You can find more at https://economistwritingeveryday.com/?s=barbie or https://economistwritingeveryday.com/?s=dolls