Writing Humanity’s Last Exam

When every frontier AI model can pass your tests, how do you figure out which model is best? You write a harder test.

That was the idea behind Humanity’s Last Exam, an effort by Scale AI and the Center for AI Safety to develop a large database of PhD-level questions that the best AI models still get wrong.

The effort has proven popular- the paper summarizing it has already been cited 91 times since its release on March 31st, and the main AI labs have been testing their new models on the exam. xAI announced today that its new Grok 4 model has the highest score yet on the exam, 44.4%.

Current leaderboard on the Humanity’s Last Exam site, not yet showing Grok 4

The process of creating the dataset is a fascinating example of a distributed academic mega-project, something that is becoming a trend that has also been important in efforts to replicate previous research. The organizers of Humanity’s Last Exam let anyone submit a question for their dataset, offering co-authorship to anyone whose question they accepted, and cash prizes to those who had the best questions accepted. In the end they wound up with just over 1000 coauthors on the paper (including yours truly as one very minor contributor), and gave out $500,000 to contributors of the very best questions (not me), which seemed incredibly generous until Scale AI sold a 49% stake in their company to Meta for $14.8 billion in June.

Source: Figure 4 of the paper

Here’s what I learned in the process of trying to stump the AIs and get questions accepted into this dataset:

  1. The AIs were harder than I expected to stump because they used frontier models rather than the free-tier models I was used to using on my own. If you think AI can’t answer your question, try a newer model
  2. It was common for me to try a question that several models would get wrong, but at least one would still get right. For me this was annoying because questions could only be accepted if every model got them wrong. But of course if you want to get a correct answer, this means trying more models is good, even if they are all in the same tier. If you can’t tell what a correct answer looks like and your question is important, make sure to try several models and see if they give different answers
  3. Top models are now quite good at interpreting regression results, even when you try to give them unusually tricky tables
  4. AI still has weird weaknesses and blind spots; it can outperform PhDs in the relevant field on one question, then do worse than 3rd graders on the next. This exam specifically wanted PhD-level questions, where a typical undergrad not only couldn’t answer the question, but probably couldn’t even understand what was being asked. But it specifically excluded “simple trick questions”, “straightforward calculation/computation questions”, and questions “easily answerable by everyday people”, even if all the AIs got them wrong. My son had the idea to ask them to calculate hyperfactorials; we found some relatively low numbers that stumped all the AI models, but the human judges ruled that our question was too simple to count. On a question I did get accepted, I included an explanation for the human judges of why I thought it wasn’t too simple.

I found this to be a great opportunity to observe the strengths and weaknesses of frontier models, and to get my name on an important paper. While the AI field is being driven primarily by the people with the chops to code frontier models, economists still have lot we can contribute here, as Joy has shown. Any economist looking for the next way to contribute here should check out Anthropic’s new Economic Futures Program.

Hallucination as a User Error

You don’t use a flat head screwdriver to drill a hole in a board. You should know to use a drill.

I appreciate getting feedback on our manuscript, “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models,” via X/Twitter. @_jannalulu wrote: “that paper only tested 4o (which arguably is a bad enough model that i almost never use it).”

Since the scope and frequency of hallucinations came as a surprise to many LLM users, they have often been used as a ‘gotcha’ to criticize AI optimists. People, myself included, have sounded the alarm that hallucinations could infiltrate articles, emails, and medical diagnoses.

The feedback I got from power users on Twitter this week made me think that there might be a cultural shift in the medium term. (Yes, we are always looking for someone to blame.) Hallucinations will be considered the fault of the human user who should have:

  1. Used a better model (learn your tools)
  2. Written a better prompt (learn how to use your tools)
  3. Assigned the wrong task to LLMs (it’s been known for over 2 years that general LLM models hallucinate citations). What did you expect from “generative” AI? LLMs are telling you what literature ought to exist as opposed to what does exist.

My Perfunctory Intern

A couple years ago, my Co-blogger Mike described his productive, but novice intern. The helper could summarize expert opinion, but they had no real understanding of their own. To boot, they were fast and tireless. Of course, he was talking about ChatGPT. Joy has also written in multiple places about the errors made by ChatGPT, including fake citations.

I use ChatGPT Pro, which has Web access and my experience is that it is not so tireless. Much like Mike, I have used ChatGPT to help me write Python code. I know the basics of python, and how to read a lot of of it. However, the multitude of methods and possible arguments are not nestled firmly in my skull. I’m much faster at reading, rather than writing Python code. Therefore, ChatGPT has been amazing… Mostly.

I have found that ChatGPT is more like an intern than many suppose:

Continue reading

Counting Hallucinations by Web-Enabled LLMs

In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon University estimates that half of Americans now use LLMs. In the Spring of 2025, we used the same prompts, based on the JEL categories, to obtain a comprehensive set of responses from LLMs about topics in economics.

Our new report on the state of citations is available at SSRN: “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models

What did we find? Would you expect the models to have improved since 2023? LLMs have gotten better and are passing ever more of what used to be considered difficult tests. (Remember the Turing Test? Anyone?) ChatGPT can pass the bar exam for new lawyers. And yet, if you ask ChatGPT to write a document in the capacity of a lawyer, it will keep making the mistake of hallucinating fake references. Hence, we keep seeing headlines like, “A Utah lawyer was punished for filing a brief with ‘fake precedent’ made up by artificial intelligence

What we call GPT-4o WS (Web Search) in the figure below was queried in April 2025. This “web-enabled” language model is enhanced with real-time internet access, allowing it to retrieve up-to-date information rather than relying solely on static training data. This means it can answer questions about current events, verify facts, and provide live data—something traditional models, which are limited to their last training cutoff, cannot do. While standard models generate responses based on patterns learned from past data, web-enabled models can supplement that with fresh, sourced content from the web, improving accuracy for time-sensitive or niche topics.

At least one third of the references provided by GPT-4o WS were not real! Performance has not significantly improved to the point where AI can write our papers with properly incorporated attribution of ideas. We also found that the web-enabled model would pull from lower quality sources like Investopedia even when we explicitly stated in the prompt, “include citations from published papers. Provide the citations in a separate list, with author, year in parentheses, and journal for each citation.” Even some of the sources that were not journal articles were cited incorrectly. We provide specific examples in our paper.

In closing, consider this quote from an interview with Jack Clark, co-founder of Anthropic:

The best they had was a 60 percent success rate. If I have my baby, and I give her a robot butler that has a 60 percent accuracy rate at holding things, including the baby, I’m not buying the butler.

Excluding “Non-Excludable” Goods

Intro microeconomics classes teach that some goods are “non-excludable”, meaning that people who don’t pay for them can’t be stopped from using them. This can lead to a “tragedy of the commons”, where the good gets overused because people don’t personally bear the cost of using it and don’t care about the costs they impose on others. Overgrazing land and overfishing the seas are classic examples.

Source: Microeconomics, by Michael Parkin

Students sometimes get the impression that “excludability” is an inherent property of a good. But in fact, which goods are excludable is a function of laws, customs, and technologies, and these can change over time. Land might be legally non-excludable (and so over-grazed) when it is held in common, but become excludable when the land is privatized or when barbed wire makes enclosing it cheap. Over time, such changes have turned over-grazing into a relatively minor issue.

Overfishing remains a major problem, but this could be starting to change. Legal and technological changes have allowed for enclosed, private aquaculture on some coasts, which provide a large and growing share of all fish eaten by humans. Permitting systems put limits on catches in many countries’ waters, though the high seas remain a true tragedy of the commons for now.

While countries have tried to enforce limits on catches in their national waters, monitoring how many fish every boat is taking has been challenging, so illegal overfishing has remained widespread. But technology is in the process of changing this. For instance, ThayerMahan is developing hydrophone arrays that use sound to track boats:

Technologies like hydrophones and satellites, if used well, will increasingly make public waters more “excludable” and reduce “tragedy of the commons” overfishing.

Saving Money by Ordering Car Parts from Amazon or eBay

Here is a personal economical anecdote from this week. A medium-sized dead branch fell from a tall tree and ripped off the driver side mirror on my old Honda. My local repair shop said it would cost around $600 to replace it. That is a significant percentage of what the old clunker is worth. Ouch.

They kindly noted that most of that cost would was ordering a replacement mirror assembly from Honda, which would cost over $400 and take several days to arrive.  I asked if I could try to get a mirror from a junkyard, to save money. The repair guy said they would be willing to install a part I brought in, but suggested eBay or Amazon instead.

Back 20 years ago, before online commerce was so established, my local repair shop would routinely save us money by getting used parts from some sort of junkyard network.
So, I started looking into that route. First, junkyards are not junkyards anymore, they are “salvage yards.” Second, it turns out that to remove a side mirror from a Honda is not a simple matter. You have to remove the inside whole plastic door panel to get at the mirror mounting screws, and removing that panel has some complications. Also, I could not find a clear online resource for locating parts at regional salvage yards. It looks like you have to drive to a salvage yard, and perhaps have them search some sort of database to find a comparable vehicle somewhere that might have the part you want.


All this seemed like a lot of hassle, so I went to eBay, and found a promising looking new replacement part there for about $56, including shipping. It would take about a week to get here (probably being direct shipped from China). On Amazon, I found essentially the same part for about $63, that would get here the next day. For the small difference and price, I went the Amazon route, partly for the no hassle returns if the part turned out to be defective and partly because I get 5% back on my Amazon credit card there.
I just got the car back from the repair shop with the replacement mirror, and it works fine. The total cost, with labor was about $230, which is much better than the original $600+ estimate.


I’m not sure how broadly to generalize this experience. Some further observations:

( 1 ) For a really critical car part, I’d have to consider carefully if the Chinese knock-off would perform appreciably worse than some name-brand part – -although, I believe many repair shops often use parts that are not strictly original parts.

( 2 ) Commonly replaced parts like oil and air filters are typically cheaper to buy on-line than from your local Auto Zone or other local merchant. I like supporting local shops, so sometimes I eat the few extra $$ and shopping time, and buy from bricks and mortar.

( 3 ) Some repair shops make significant money on their markup on parts, and so they might not be happy about you bringing in your own parts. They also might decline to warrant the operation of that part. And many big box franchise repair shops may simply refuse to install customer-supplied parts.

( 4 ) For a newish car, still under warranty, the manufacturer warranty might be affected by using non-original parts.

( 5 ) Back to junk/salvage yards: there are some car parts, so-called hard parts, that are expected to last the life of the car. Things like the mounting brackets for engine parts. Typically, no spares of these are manufactured. So, if one of those parts gets dinged up in an accident, your only option may be used parts taken from a junker.

Illusions of Illusions of Reasoning

Even since Scott’s post on Tuesday of this week, a new response has been launched titled “The Illusion of the Illusion of the Illusion of Thinking

Abstract (emphasis added by me): A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an “accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations in their reasoning capabilities when faced with planning puzzles of increasing complexity. A compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking, argued these findings are not evidence of reasoning failure but rather artifacts of flawed experimental design, such as token limits and the use of unsolvable problems. This paper provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify critical methodological flaws that invalidate the most severe claims of the original paper, their own counter-evidence and conclusions may oversimplify the nature of model limitations. By shifting the evaluation from sequential execution to algorithmic generation, their work illuminates a different, albeit important, capability. We conclude that the original “collapse” was indeed an illusion created by experimental constraints, but that Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

As am writing a new manuscript about hallucination of web-enabled models, this is close to what I am working on. Conjuring up fake academic references might point to a lack of true reasoning ability.

Do Pro and Dantas believe that LLMs can reason? What they are saying, at least, is that evaluating AI reasoning is difficult. In their words, the whole back-and-forth “highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from sophisticated pattern matching of familiar problems…”

The fact that the first sentence of the paper contains the bigram “true reasoning” is interesting in itself. No one dobuts that LLMs are reasoning anymore, at least within their own sandboxes. Hence there have been Champagne jokes going around of this sort:

If you’d like to read a response coming from o3 itself, Tyler pointed me to this:

Did Apple’s Recent “Illusion of Thinking” Study Expose Fatal Shortcomings in Using LLM’s for Artificial General Intelligence?

Researchers at Apple last week published with the provocative title, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”  This paper has generated uproar in the AI world. Having “The Illusion of Thinking” right there in the title is pretty in-your-face.

Traditional Large Language Model (LLM) artificial intelligence programs like ChatGPT train on massive amounts of human-generated text to be able to mimic human outputs when given prompts. A recent trend (mainly starting in 2024) has been the incorporation of more formal reasoning capabilities into these models. The enhanced models are termed Large Reasoning Models (LRMs). Now some leading LLMs like Open AI’s GPT, Claude, and the Chinese DeepSeek exist both in regular LLM form and also as LRM versions.

The authors applied both the regular (LLM) and “thinking” LRM versions of Claude 3.7 Sonnet and DeepSeek to a number of mathematical type puzzles. Open AI’s o-series were used to a lesser extent. An advantage of these puzzles is that researchers can, while keeping the basic form of the puzzle, dial in more or less complexity.

They found, among other things, that the LRMs did well up to a certain point, then suffered “complete collapse” as complexity was increased. Also, at low complexities, LLMs actually outperform LRMs. And (perhaps the most vivid evidence of lack of actual understanding on the part of these programs), when they were explicitly offered an efficient direct solution algorithm in the prompt, the programs did not take advantage of it, but instead just kept grinding away in their usual fashion.

As might be expected, AI skeptics were all over the blogosphere, saying, I told you so, LLMs are just massive exercises in pattern matching, and cannot extrapolate outside of their training set. This has massive implications for what we can expect in the near or intermediate future. Among other things, the optimism about AI progress is largely what is fueling the stock market, and also capital investment in this area: Companies like Meta and Google are spending ginormous sums trying to develop artificial “general” intelligence, paying for ginormous amounts of compute power, with those dollars flowing to firms like Microsoft and Amazon building out data centers and buying chips from Nvidia. If the AGI emperor has no clothes, all this spending might come to a screeching crashing halt.

Ars Technica published a fairly balanced account of the controversy, concluding that, “Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them… especially for coding and brainstorming and writing.”

Comments on this article included one like:

LLMs do not even know what the task is, all it knows is statistical relationships between words.   I feel like I am going insane. An entire industry’s worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.

And

if we equate combinatorial play and pattern matching with genuinely “generative/general” intelligence, then we’re missing a key fact here. What’s missing from all the LLM hubris and enthusiasm is a reflexive consciousness of the limits of language, of the aspects of experience that exceed its reach and are also, paradoxically, the source of its actual innovations. [This is profound, he means that mere words, even billions of them, cannot capture some key aspects of human experience]

However, the AI bulls have mounted various come-backs to the Apple paper. The most effective I know of so far was published by Alex Lawsen, a researcher at LLM firm Open Philanthropy. Lawsen’s rebuttal, titled “The Illusion of the Illusion of Thinking,  was summarized by Marcus Mendes. To summarize the summary, Lawsen claimed that the models did not in general “collapse” in some crazy way. Rather, the models in many cases recognized that they would not be able to solve the puzzles given the constraints input by the Apple researchers. Therefore, they (rather intelligently) did not try to waste compute power by grinding away to a necessarily incomplete solution, but just stopped. Lawsen further showed that the ways Apple ran the LRM models did not allow them to perform as well as they could. When he made a modest, reasonable change in the operation of the LRMs,

Models like Claude, Gemini, and OpenAI’s o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success.

Lawsen’s conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks. At least in terms of algorithm generation.

And so, the great debate over the prospects of artificial general intelligence will continue.

The Comeback of Gold as Money

According to Merriam-Webster, “money” is: “something generally accepted as a medium of exchange, a measure of value, or a means of payment.”  Money, in its various forms, also serves as a store of value.  Gold has maintained the store of value function all though the past centuries, including our own times; as an investment, gold has done well in the past couple of decades. I plan to write more later on the investment aspect, but here I focus on the use of physical gold as a means of payment or exchange, or as backing a means of exchange.

Gold, typically in the form of standardized coins, served means of exchange function for thousands of years. Starting in the Renaissance, however, banks started issuing paper certificates which were exchangeable for gold. For daily transactions, the public found it more convenient to handle these bank notes than the gold pieces themselves, and so these notes were used instead of gold as money.     

In the late nineteenth and early twentieth centuries, leading paper currencies like the British pound and the U.S. dollar were theoretically backed by gold; one could turn in a dollar and convert it to the precious metal. Most countries dropped the convertibility to gold during the Great Depression of the 1930’s, so their currencies became entirely “fiat” money, not tied to any physical commodity. For the U.S. dollar, there was limited convertibility to gold after World War II as part of the Bretton Woods system of international currencies, but even that convertibility ended in 1971. In fact, it was illegal for U.S. citizens to own much in the way of physical gold from FDR’s (infamous?) executive order in 1933 until Gerald Ford’s repeal of that order in 1977.

So gold has been essentially extinct as active money for nearly a hundred years. The elite technocrats who manage national financial affairs have been only too happy to dance on its grave. Keynes famously denounced the gold standard as a “barbarous relic”, standing in the way of purposeful management of national money matters.

However, gold seems to be making something of a comeback, on several fronts. Most notably, several U.S. states have promoted the use of gold in transactions. Deep-red Utah has led the way.  In 2011, Utah passed the Legal Tender Act, recognizing gold and silver coins issued by the federal government as legal tender within the state. This legislation allows individuals to transact in gold and silver coins without paying state capital gains tax.  The Utah House and Senate passed bills in 2025 to authorize the state treasurer to establish a precious metals-backed electronic payment platform, which would enable state vendors to opt for payments in physical gold and silver. The Utah governor vetoed this bill, though, claiming it was “operationally impractical.” 

Meanwhile, in Texas:

The new legislation, House Bill 1056, aims to give Texans the ability, likely through a mobile app or debit card system, to use gold and silver they hold in the state’s bullion depository to purchase groceries or other standard items.

The bill would also recognize gold and silver as legal tender in Texas, with the caveat that the state’s recognition must also align with currency laws laid out in the U.S. Constitution.

“In short, this bill makes gold and silver functional money in Texas,” Rep. Mark Dorazio (R-San Antonio), the main driving force behind the effort, said during one 2024 presentation. “It has to be functional, it has to be practical and it has to be usable.”

Arkansas and Florida have also passed laws allowing the use of gold and silver as legal tender. A potential problem is that under current IRS law, gold and silver are generally classified as collectibles and subject to potential capital gains taxes when transactions occur. Texas legislator Dorazio has argued that liability would go away if the metals are classified as functional money, although he’s also acknowledged the tax issue “might end up being decided by the courts.”

But as Europeans found back in the day, carrying around actual clinking gold coins for purchasing and making change is much more of a hassle than paper transactions. And so, various convenient payment or exchange methods, backed by physical gold, have recently arisen.

Since it is relatively easy and lucrative to spawn a new cryptocurrency (which is why there are thousands of them), it is not surprising that there are now several coins supposedly backed by bullion. These include include Paxos Gold (PAXG) and Tether Gold (XAUT). The gold of Paxos is stored in the worldwide vaults of Brinks, and is regularly audited by a credible third party. Tether gold supposedly resides somewhere in Switzerland. The firm itself is incorporated in the British Virgin Islands. Tether in general does not conduct regular audits; its official statements dance around that fact. These crypto coins, like bullion itself or various funds like GLD that hold gold, are in practice probably mainly an investment vehicle (store of value), rather than an active medium of exchange.

However, getting down to the consumer level of payment convenience, we now have a gold-backed credit card (Glint) and debit card (VeraCash Mastercard). Both of these hold their gold in Swiss vaults. The funds you place with these companies have gold allocated to them, so these are a (seemingly cost-effective) means to own gold. If you get nervous, you can actually (subject to various rules) redeem your funds for actual shiny yellow metal.

Papers about Economists Using LLMs

  1. The most recent (published in 2025) is this piece about doing data analytics that would have been too difficult or costly before. Link and title: Deep Learning for Economists

Considering how much of frontier economics revolves around getting new data, this could be important. On the other hand, people have been doing computer-aided data mining for a while. So it’s more of a progression than a revolution, in my expectation.

2. Using LLMs to actually generate original data and/or test hypotheses like experimenters: Large language models as economic agents: what can we learn from homo silicus? and Automated Social Science: Language Models as Scientist and Subjects

3. Generative AI for Economic Research: Use Cases and Implications for Economists

Korinek has a new supplemental update as current as December 2024: LLMs Learn to Collaborate and Reason: December 2024 Update to “Generative AI for Economic Research: Use Cases and Implications for Economists,” Published in the Journal of Economic Literature 61 (4)

4. For being comprehensive and early: How to Learn and Teach Economics with Large Language Models, Including GPT

5. For giving people proof of a phenomenon that many people had noticed and wanted to discuss: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Alert: We will soon have an update for current web-enabled models! It would seem that hallucination rates are going down but the problem is not going away.

6. This was published back in 2023. “ChatGPT ranked in the 91st percentile for Microeconomics and the 99th percentile for Macroeconomics when compared to students who take the TUCE exam at the end of their principles course.” (note the “compared to”): ChatGPT has Aced the Test of Understanding in College Economics: Now What?

References          

Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist69(1), 80-87. https://doi.org/10.1177/05694345231218454 (Original work published 2024)

Cowen, Tyler and Tabarrok, Alexander T., How to Learn and Teach Economics with Large Language Models, Including GPT (March 17, 2023). GMU Working Paper in Economics No. 23-18, Available at SSRN: https://ssrn.com/abstract=4391863 or http://dx.doi.org/10.2139/ssrn.4391863

Dell, M. (2025). Deep Learning for Economists. Journal of Economic Literature, 63(1), 5–58. https://doi.org/10.1257/jel.20241733

Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has Aced the Test of Understanding in College Economics: Now What? The American Economist68(2), 233-245. https://doi.org/10.1177/05694345231169654 (Original work published 2023)

Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv Preprint arXiv:2301.07543.

Korinek, A. (2023). Generative AI for Economic Research: Use Cases and Implications for Economists. Journal of Economic Literature, 61(4), 1281–1317. https://doi.org/10.1257/jel.20231736

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated Social Science: Language Models as Scientist and Subjects (Working Paper No. 32381). National Bureau of Economic Research. https://doi.org/10.3386/w32381