Writing Humanity’s Last Exam

When every frontier AI model can pass your tests, how do you figure out which model is best? You write a harder test.

That was the idea behind Humanity’s Last Exam, an effort by Scale AI and the Center for AI Safety to develop a large database of PhD-level questions that the best AI models still get wrong.

The effort has proven popular- the paper summarizing it has already been cited 91 times since its release on March 31st, and the main AI labs have been testing their new models on the exam. xAI announced today that its new Grok 4 model has the highest score yet on the exam, 44.4%.

Current leaderboard on the Humanity’s Last Exam site, not yet showing Grok 4

The process of creating the dataset is a fascinating example of a distributed academic mega-project, something that is becoming a trend that has also been important in efforts to replicate previous research. The organizers of Humanity’s Last Exam let anyone submit a question for their dataset, offering co-authorship to anyone whose question they accepted, and cash prizes to those who had the best questions accepted. In the end they wound up with just over 1000 coauthors on the paper (including yours truly as one very minor contributor), and gave out $500,000 to contributors of the very best questions (not me), which seemed incredibly generous until Scale AI sold a 49% stake in their company to Meta for $14.8 billion in June.

Source: Figure 4 of the paper

Here’s what I learned in the process of trying to stump the AIs and get questions accepted into this dataset:

  1. The AIs were harder than I expected to stump because they used frontier models rather than the free-tier models I was used to using on my own. If you think AI can’t answer your question, try a newer model
  2. It was common for me to try a question that several models would get wrong, but at least one would still get right. For me this was annoying because questions could only be accepted if every model got them wrong. But of course if you want to get a correct answer, this means trying more models is good, even if they are all in the same tier. If you can’t tell what a correct answer looks like and your question is important, make sure to try several models and see if they give different answers
  3. Top models are now quite good at interpreting regression results, even when you try to give them unusually tricky tables
  4. AI still has weird weaknesses and blind spots; it can outperform PhDs in the relevant field on one question, then do worse than 3rd graders on the next. This exam specifically wanted PhD-level questions, where a typical undergrad not only couldn’t answer the question, but probably couldn’t even understand what was being asked. But it specifically excluded “simple trick questions”, “straightforward calculation/computation questions”, and questions “easily answerable by everyday people”, even if all the AIs got them wrong. My son had the idea to ask them to calculate hyperfactorials; we found some relatively low numbers that stumped all the AI models, but the human judges ruled that our question was too simple to count. On a question I did get accepted, I included an explanation for the human judges of why I thought it wasn’t too simple.

I found this to be a great opportunity to observe the strengths and weaknesses of frontier models, and to get my name on an important paper. While the AI field is being driven primarily by the people with the chops to code frontier models, economists still have lot we can contribute here, as Joy has shown. Any economist looking for the next way to contribute here should check out Anthropic’s new Economic Futures Program.

23 MSAs Produce Half of US GDP

The 23 blue-shaded MSAs in this map produce half of US GDP:

You might be tempted to think this map, like so many maps, is just a map of US population. It kind of is, but not completely. These 23 MSAs have 133 million people (as of the 2020 Census), or about 40% of the US population. That’s a lot, but it’s much less than half, which the GDP proportion they account for. In other words, these MSAs also tend to have above-average per capita income.

The three largest MSAs by population (NY, LA, Chicago) are also the three largest by GDP. But after the first three there are some interesting discrepancies. The San Francisco MSA is the 4th largest by GDP, but only the 12th largest by population — San Fran has a population similar to the Phoenix MSA, but almost double the GDP. San Francisco MSA has a very high GDP per capita (the third highest).

The San Jose MSA is also among these 23 largest MSAs for GDP, and also sticks out — it is the 13th largest by total GDP, but only the 36th largest by population. San Jose has a population similar to Cleveland and Nashville, but well over double the GDP of these two MSAs individually. In fact, there are 12 MSAs larger in population than San Jose, but that aren’t among these 23 MSAs that produce half of US GDP: places like St. Louis, Orlando, San Antonio, Pittsburgh, and Columbus. Silicon Valley really pulls up San Jose: it has the 2nd largest GDP per capita among MSAs, only beaten by much smaller Midland, Texas and its oil income.

Here is the full list of those 23 MSAs:

  1. New York-Newark-Jersey City, NY-NJ-PA
  2. Los Angeles-Long Beach-Anaheim, CA
  3. Chicago-Naperville-Elgin, IL-IN-WI
  4. San Francisco-Oakland-Berkeley, CA
  5. Dallas-Fort Worth-Arlington, TX
  6. Washington-Arlington-Alexandria, DC-VA-MD-WV
  7. Houston-The Woodlands-Sugar Land, TX
  8. Boston-Cambridge-Newton, MA-NH
  9. Seattle-Tacoma-Bellevue, WA
  10. Atlanta-Sandy Springs-Alpharetta, GA
  11. Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
  12. Miami-Fort Lauderdale-Pompano Beach, FL
  13. San Jose-Sunnyvale-Santa Clara, CA
  14. Phoenix-Mesa-Chandler, AZ
  15. Minneapolis-St. Paul-Bloomington, MN-WI
  16. Detroit-Warren-Dearborn, MI
  17. San Diego-Chula Vista-Carlsbad, CA
  18. Denver-Aurora-Lakewood, CO
  19. Baltimore-Columbia-Towson, MD
  20. Austin-Round Rock-Georgetown, TX
  21. Charlotte-Concord-Gastonia, NC-SC
  22. Riverside-San Bernardino-Ontario, CA
  23. Tampa-St. Petersburg-Clearwater, FL

Economic Impact of Agricultural Worker Deportations Leads to Administration Policy Reversals

Here is a chart of the evolution of U.S. farm workforce between 1991 and 2022:

Source: USDA

A bit over 40% of current U.S. farm workers are illegal immigrants. In some regions and sectors, the percentage is much higher. The work is often uncomfortable and dangerous, and far from the cool urban centers. This is work that very few U.S. born workers would consider doing, unless the pay was very high, so it would be difficult to replace the immigrant labor on farms in the near term. I don’t know how much the need for manpower would change if cheap illegal workers were not available, and therefore productivity was supplemented with automation.

It apparently didn’t occur to some members of the administration that deporting a lot of these workers (and frightening the rest into hiding) would have a crippling effect on American agriculture. Sure enough, there have recently been reports in some areas of workers not showing up and crops going unharvested.

It is difficult for me as a non-expert to determine how severe and widespread the problems actually are so far. Anti-Trump sources naturally emphasize the genuine problems that do exist and predict apocalyptic melt-down, whereas other sources are more measured. I suspect that the largest agribusinesses have kept better abreast of the law, while smaller operations have cut legal corners and may have that catch up to them. For instance, a small meat packer in Omaha reported operating at only 30% capacity after ICE raids, whereas the CEO of giant Tyson Foods claimed that “every one who works at Tyson Foods is authorized to do so,” and that the company “is in complete compliance” with all the immigration regulations.

With at least some of these wholly predictable problems from mass deportations now becoming reality, the administration is undergoing internal debates and policy adjustments in response. On June 12, President Trump very candidly acknowledged the issue, writing on Truth Social, “Our great Farmers and people in the hotel and leisure business have been stating that our very aggressive policy on immigration is taking very good, long-time workers away from them, with those jobs being almost impossible to replace…. We must protect our Farmers, but get the CRIMINALS OUT OF THE USA. Changes are coming!” 

The next day, ICE official Tatum King wrote regional leaders to halt investigations of the agricultural industry, along with hotels and restaurants. That directive was apparently walked back a few days later, under pressure from outraged conservative supporters and from Deputy White House Chief of Staff Stephen Miller. Miller, an immigration hard-liner, wants to double the ICE deportation quota, up to 3,000 per day.

This issue could go in various ways from here. Hard-liners on the left and on the right have a way of pushing their agendas to unpalatable extremes. It can be argued that the Democrats could easily have won in 2024 had their policies been more moderate. Similarly, if immigration hard-liners get their way now, I predict that the result will be their worst nightmare: a public revulsion against enforcing immigration laws in general. If farmers and restaurateurs start going bust, and food shortages and price spikes appear in the supermarket, public support for the administration and its project of deporting illegal immigrants will reverse in a big way. Some right-wing pundits would not be bothered by an electoral debacle, since their style is to stay constantly outraged, and (as the liberal news outlets currently demonstrate), it is easier to project non-stop outrage when your party is out of power.

An optimist, however, might see in this controversy an opening for some sort of long-term, rational solution to the farm worker issue. Agricultural Secretary Brooke Rollins has proposed expansion of the H-2A visa program, which allows for temporary agricultural worker residency to fill labor shortages. This is somewhat similar to the European guest worker programs, though with significant differences. H-2A requires the farmer to provide housing and take legal responsibility for his or her workers. H-2B visas allow for temporary non-agricultural workers, without as much employer responsibility. A bill was introduced into Congress with bi-partisan support to modernize the H-2A program, so that legislative effort may have legs. Maybe there can be a (gasp!) compromise.

President Trump last week came out strongly in favor of this sort of solution, with a surprisingly positive take on the (illegal) workers who have worked diligently on a farm for years. By “put you in charge” he is seems to refer to the responsibilities that H-2A employers undertake for their employers, and perhaps extending that to H-2B employers. He acknowledges that the far-right will not be happy, but hopes “they’ll understand.” From Newsweek:

“We’re working on legislation right now where – farmers, look, they know better. They work with them for years. You had cases where…people have worked for a farm, on a farm for 14, 15 years and they get thrown out pretty viciously and we can’t do it. We gotta work with the farmers, and people that have hotels and leisure properties too,” he said at the Iowa State Fairgrounds in Des Moines on Thursday.

“We’re gonna work with them and we’re gonna work very strong and smart, and we’re gonna put you in charge. We’re gonna make you responsible and I think that that’s going to make a lot of people happy. Now, serious radical right people, who I also happen to like a lot, they may not be quite as happy but they’ll understand. Won’t they? Do you think so?”

We shall see.

It’s not AGI if it has a dial you can adjust to produce your preferred falsehoods

It’s not AGI, it’s barely even regular AI, when an LLM is this heavily directed. This appears to be very real over on Twitter. What’s most telling is the thinness of the prompts that yield very specific responses that, suffice it to say, Grok would not have provided even 3 months ago.

Musk has adjusted Grok’s algorithm so it’s now a neo-Nazi.Pretty cool that almost every progressive commentator, elected official and organization still uses Musk’s X algorithm to communicate with the public! Good job guys.

Max Berger (@maxberger.bsky.social) 2025-07-06T17:39:08.885Z

I’ll simply say this: no one has declined more in my estimation in my entire life than Elon Musk. I thought he was an engineering genius not even 5 years ago, perhaps awkward in some ways, but earnest. Now he is (or is working very diligently to project an identity of) a white supremacist desperate to play off of traditional racist and antisemitic fears to maintain his own status and influence. His ambition and resources have been combined with a monstrous agenda, and the world is much worse for it. It’s tragic in every way.

With regards to AI, there needs to be more discussion of the market for AIs, plural. I think a lot of people are operating off the assumption that AI will be like Google or VHS. A natural monopoly; one AI to rule them all and bind them. I’m not so sure. I think there is a very real chance that AI’s will find niches. That different algorithms will create different families of bespoke AIs. It feels like the world is already siloed into echo chambers of entertainment- and identity-based news feeds. If AI allows us each to get bespoke answers, serving our own person confirmation biases, to each and every question, is that better or worse? In a counter intuitive way, it could actually be better. You can’t get communities and cults of one. It might be better for the world if the news became something you couldn’t create effective propaganda out of.

Hallucination as a User Error

You don’t use a flat head screwdriver to drill a hole in a board. You should know to use a drill.

I appreciate getting feedback on our manuscript, “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models,” via X/Twitter. @_jannalulu wrote: “that paper only tested 4o (which arguably is a bad enough model that i almost never use it).”

Since the scope and frequency of hallucinations came as a surprise to many LLM users, they have often been used as a ‘gotcha’ to criticize AI optimists. People, myself included, have sounded the alarm that hallucinations could infiltrate articles, emails, and medical diagnoses.

The feedback I got from power users on Twitter this week made me think that there might be a cultural shift in the medium term. (Yes, we are always looking for someone to blame.) Hallucinations will be considered the fault of the human user who should have:

  1. Used a better model (learn your tools)
  2. Written a better prompt (learn how to use your tools)
  3. Assigned the wrong task to LLMs (it’s been known for over 2 years that general LLM models hallucinate citations). What did you expect from “generative” AI? LLMs are telling you what literature ought to exist as opposed to what does exist.

My Perfunctory Intern

A couple years ago, my Co-blogger Mike described his productive, but novice intern. The helper could summarize expert opinion, but they had no real understanding of their own. To boot, they were fast and tireless. Of course, he was talking about ChatGPT. Joy has also written in multiple places about the errors made by ChatGPT, including fake citations.

I use ChatGPT Pro, which has Web access and my experience is that it is not so tireless. Much like Mike, I have used ChatGPT to help me write Python code. I know the basics of python, and how to read a lot of of it. However, the multitude of methods and possible arguments are not nestled firmly in my skull. I’m much faster at reading, rather than writing Python code. Therefore, ChatGPT has been amazing… Mostly.

I have found that ChatGPT is more like an intern than many suppose:

Continue reading

The Ugly Gray Rhino Gathers Speed

A black swan is a crisis that comes out of nowhere. A gray rhino, by contrast, is a problem we have known about for a long time, but can’t or won’t stop, that will at some point crash into a full-blown crisis.

The US national debt is a classic gray rhino. The problem has slowly been getting worse for 25 years, but the crisis still seems far enough off that almost no one wants to incur real costs today to solve the problem. During the 2007-2009 financial crisis and the 2020-2021 Covid pandemic we had good reasons to run deficits. But we’ve ignored the Keynesian solution of paying back the deficits incurred in bad times with surpluses in good times.

We are currently in reasonably good economic times, but about to pass a mega-spending bill that blows the deficit up from its already-too-high-levels. At a time when we should be running a surplus, we are instead running a deficit around 6% of GDP:

Source: Congressional Budget Office

Our ‘primary deficit’ is lower, a more manageable 3% of GDP. But if interest rates go higher, either for structural reasons or because of a loss of confidence in the US government’s willingness to pay its debts, the total deficit could spiral higher rapidly. The CBO optimistically assumed that the interest rate on 10-year treasuries will fall below 4% in the 2030s, from 4.3% today:

Source: Congressional Budget Office

But their scoring of H.R. 1 (“One Big Beautiful Bill Act”) shows it adding $3 trillion to the debt over the next 10 years, increasing the deficit by ~1% of GDP per year.

I already suspected this gray rhino would eventually cause a crisis, but this bill and the milieu that produced turn it into a near guarantee- nothing stops the deficit train until we hit a full blown crisis. That crisis is no longer just a long-term issue for your kids and grandkids to worry about- you will see it in 7 years or so. Unfortunately, that is still far enough away that current politicians have no incentive to take costly steps to avoid it. In fact, deficits will probably make the economy stronger for a year or two before they start making things worse- convenient for all the Congresspeople up for election in less than 2 years.

Here are the ways I see this playing out, from most to least likely:

  1. By around 2032, either the slowly aging population or a sudden spike in interest rates forces the government to touch at least one of the third rails of American politics: cut Social Security, cut Medicare, or substantially raise taxes on the middle class (explicitly or through inflation).
  2. We get bailed out again by God’s Special Providence for fools, drunks, and the United States of America. AI brings productivity miracles bigger than those of computers and the internet, letting GDP grow faster than our debts.
  3. We default on the national debt (but this is a risky option because we will still want to run big deficits, and lenders will only lend if they expect to get paid back).
  4. We do all the smart policy reforms that economists recommend in time to head off the crisis and stop the rhino. Medical spending falls without important services being cut thanks to supply-side reforms or cheap miracle drugs (GLP-1s going off patent?).

I’m hoping of course for numbers 2 and 4, but after this bill I’m expecting the rhino.

Renting an Electric Vehicle in Norway

I recently spent a week in Norway with my family. Highly recommended overall. While we were mostly able to get around the country by train, we needed to rent a car to get to a small, remote village where my great grandfather came from, and where I still have relatives. Prior to the drilling of several massive car tunnels in the 1980s and 1990s, Fjaerland was only accessible by boat.

And if you are renting a car in Norway today, it’s highly likely you will be renting an electric car (unless you specifically ask for a gas-powered car, as the older German couple in front of me at the rental counter did). The vast majority of new cars sold in Norway (over 90%) are electric, and since most rentals are new cars, that’s what they have.

Norway has made the biggest push in the world through public policy to encourage EV adoption, both for buying cars and for building up a charging infrastructure. In this post I will primarily focus on the consumer experience of renting an EV, though the public policy surrounding it is worth a discussion too.

So how was the experience?

Continue reading

Central Banks Are Buying Gold; Should You?

Anyone who reads financial headlines knows that gold prices have soared in the past year. Why?

Gold has historically been a relatively stable store of value, and that role seems to be returning after decades of relative neglect. Official numbers show sharply increased buying by the world’s central banks, led by China, Poland, and Azerbaijan in early 2025. Russia, India and Turkey have also been major buyers. There is widespread conviction that actual gold purchases are appreciably higher than the officially-reported numbers, to side-step President Trump’s threatened extra tariffs on nations seen as de-dollarizing.

I think the most proximate cause for the sharp run-up in gold prices in the past twelve months has been the profligate U.S. federal budget deficit, under both administrations. This is convincing key world actors that the dollar will become increasingly devalued over time, no matter which party is in power. Thus, it is prudent to get out of dollars and dollar-denominated assets like U.S. T-bonds.

Trump’s erratic and offensive policies and statements in 2025 have added to the desire to diversify away from U.S. assets. This is in addition to the alarm in non-Western countries over the impoundment of Russian dollar-related assets in connection with the ongoing Russian invasion of Ukraine. Also, there is something of a self-fulfilling momentum aspect to any asset: the more it goes up, the more it is expected to go up.

This informative chart of central bank gold net purchasing is courtesy of Weekend Investing:

Interestingly, central banks were net sellers in the 1990s and early 2000s; it was an era of robust economic growth, gold prices were stagnant or declining, and it seemed pointless to hold shiny metal bars when one could invest in financial assets with higher rates of return. The Global Financial Crisis of 2008-2009 apparently sobered up the world as to the fragility of financial assets, making solid metal bars look pretty good. Then, as noted, the Western reaction to the Russian attack on Ukraine spurred central bank buying gold, as this blog predicted back in March, 2022.

Private investors are also buying gold, for similar reasons as the central banks. Gold offers portfolio diversification as a clear alternative from all paper assets. In theory it should offer something of an inflation hedge, but its price does not always track with inflation or interest rates.

Here is how gold (using GLD fund as a proxy) has fared versus stocks (S&P 500 index) and intermediate term U. S. T-bonds (IEF fund) in the past year:

Gold is up by 40%, compared to 12.6% for stocks. That is huge outperformance. This was driven largely by the fact that gold rose strongly in the Feb-April timeframe, while stocks were collapsing.

Below we zoom out to look at the past ten years, and include the intermediate-term T-bond fund IEF:

Gold prices more than doubled from 2008 to 2011, then suffered a long, painful decline over the next two years. Prices were then fairly stagnant for the mid-2010s, rose significantly 2019-2020, then stagnated again until taking off in 2023. Stocks have been much more erratic. Most of the time stock returns were above gold, but the 2020 and 2024 plunges brought stocks down to rough parity with gold. Since about 2019, T-bonds have been pathetic; pity the poor investor who has been (according to traditional advice) 40% invested in investment-grade bonds.

How to invest in gold? Hard-core gold bugs want the actual coins (no-one can afford a full bullion bar) to rub between their fingers and keep in their own physical custody. You can buy coins from on-line dealers or local dealers. Coins are available from the U.S. Mint, but reportedly their mark-ups are often higher than on the secondary market. 

An easier route for most folks is to buy into a gold-backed stock fund. The biggest is GLD, which has over $100 billion in assets. There has long been an undercurrent of suspicion among gold bugs that GLD’s gold is not reliably audited or that it is loaned out; they refer derisively to GLD as “paper gold” or gold derivatives.  The fund itself claims that it never lends out its gold, and that its bars are held in the vaults of the custodian banks JPMorgan Chase Bank, N.A. and HSBC Bank plc, and are independently audited. The suspicious crowd favors funds like Sprott Physical Gold Trust, PHYS. PHYS is claimed to have a stronger legal claim on its physical gold than GLD. However, PHYS is a closed-end fund, which means it does not have a continuous creation process like GLD, an open-end ETF. This can lead to discrepancies between the fund’s share price and the value of its gold holdings. It does seem like PHYS loses about 1% per year relative to GLD.

Disclaimer: Nothing here should be taken as advice to buy or sell any security.