Did Apple’s Recent “Illusion of Thinking” Study Expose Fatal Shortcomings in Using LLM’s for Artificial General Intelligence?

Researchers at Apple last week published with the provocative title, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”  This paper has generated uproar in the AI world. Having “The Illusion of Thinking” right there in the title is pretty in-your-face.

Traditional Large Language Model (LLM) artificial intelligence programs like ChatGPT train on massive amounts of human-generated text to be able to mimic human outputs when given prompts. A recent trend (mainly starting in 2024) has been the incorporation of more formal reasoning capabilities into these models. The enhanced models are termed Large Reasoning Models (LRMs). Now some leading LLMs like Open AI’s GPT, Claude, and the Chinese DeepSeek exist both in regular LLM form and also as LRM versions.

The authors applied both the regular (LLM) and “thinking” LRM versions of Claude 3.7 Sonnet and DeepSeek to a number of mathematical type puzzles. Open AI’s o-series were used to a lesser extent. An advantage of these puzzles is that researchers can, while keeping the basic form of the puzzle, dial in more or less complexity.

They found, among other things, that the LRMs did well up to a certain point, then suffered “complete collapse” as complexity was increased. Also, at low complexities, LLMs actually outperform LRMs. And (perhaps the most vivid evidence of lack of actual understanding on the part of these programs), when they were explicitly offered an efficient direct solution algorithm in the prompt, the programs did not take advantage of it, but instead just kept grinding away in their usual fashion.

As might be expected, AI skeptics were all over the blogosphere, saying, I told you so, LLMs are just massive exercises in pattern matching, and cannot extrapolate outside of their training set. This has massive implications for what we can expect in the near or intermediate future. Among other things, the optimism about AI progress is largely what is fueling the stock market, and also capital investment in this area: Companies like Meta and Google are spending ginormous sums trying to develop artificial “general” intelligence, paying for ginormous amounts of compute power, with those dollars flowing to firms like Microsoft and Amazon building out data centers and buying chips from Nvidia. If the AGI emperor has no clothes, all this spending might come to a screeching crashing halt.

Ars Technica published a fairly balanced account of the controversy, concluding that, “Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them… especially for coding and brainstorming and writing.”

Comments on this article included one like:

LLMs do not even know what the task is, all it knows is statistical relationships between words.   I feel like I am going insane. An entire industry’s worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.

And

if we equate combinatorial play and pattern matching with genuinely “generative/general” intelligence, then we’re missing a key fact here. What’s missing from all the LLM hubris and enthusiasm is a reflexive consciousness of the limits of language, of the aspects of experience that exceed its reach and are also, paradoxically, the source of its actual innovations. [This is profound, he means that mere words, even billions of them, cannot capture some key aspects of human experience]

However, the AI bulls have mounted various come-backs to the Apple paper. The most effective I know of so far was published by Alex Lawsen, a researcher at LLM firm Open Philanthropy. Lawsen’s rebuttal, titled “The Illusion of the Illusion of Thinking,  was summarized by Marcus Mendes. To summarize the summary, Lawsen claimed that the models did not in general “collapse” in some crazy way. Rather, the models in many cases recognized that they would not be able to solve the puzzles given the constraints input by the Apple researchers. Therefore, they (rather intelligently) did not try to waste compute power by grinding away to a necessarily incomplete solution, but just stopped. Lawsen further showed that the ways Apple ran the LRM models did not allow them to perform as well as they could. When he made a modest, reasonable change in the operation of the LRMs,

Models like Claude, Gemini, and OpenAI’s o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success.

Lawsen’s conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks. At least in terms of algorithm generation.

And so, the great debate over the prospects of artificial general intelligence will continue.

Kayfabe in the political marketplace

Kayfabe: the tacit agreement to behave as if something is real, sincere, or genuine when it is not

The term comes from wrestling, which is fitting because for years I’ve been stealing Dana Gould’s line “Politics is just professional wrestling in suits.” There are limits to my casual theft, though. I will not pretend that I am the first to observe that politics has deeply internalized the kayfabe code of vehemently declaring beliefs or expectations in no way actually held while simultaneously understanding that your rivals are doing the same.

I am curious whether you can undermine your own political agenda or influence by going too far, by committing too much to a fictional worldview. Much like Serpico, if you go too deep you can lose track of the real you. Grandstanding in front of the cameras is one thing, but if you want to trade in the political marketplace, you need to be able to credibly do so in good faith. I can’t help but wonder if part of the inability of Congress to assert it’s constitutional power in face of Executive overreach is a political market failue. Has the information and signal quality between represenatives been so eroded that prices and, in turn, exchanges can’t emerge?

On a more optimistic note, however, I expect that at some point a political party will find advantage in have better norms of credible signaling if only because they will have an easier time solving their own collective action problem. The question remains, though, at what point will the political advantage of superior collective action dominate the electoral advantage of earnestly lying to voters?

Waymo and Pictures Online

Memes on Twitter are at least as important today as political cartoons of the 19th and 20th centuries. Two memes went viral this week, surrounding the events of protests in Los Angeles.  

The first meme reflects how Lord of the Rings is deeply embedded in American culture. Two million views means that most people don’t need the joke explained. (Part of the Lord of the Rings story is that the wise and powerful elves abandon chaotic Middle Earth for the safety of the Grey Havens. Similarly, the Waymo cars quietly drove themselves to safety away from Los Angeles after several had been vandalized and burned.)

I worry that The Lord of the Rings has made us too optimistic. Americans rarely tolerate stories that do not have happy endings. Has that made it hard for us to understand global events, or impaired our ability to accurately predict how most battles will end?

The next viral meme about the Waymos has two tweets to track. The originator of the joke got half a million views. Someone who added AI-generated images to the original text got half a million views as well.

The four quadrants form is a common meme format. Starting in the upper left, perhaps the best way to explain this is that a rigid socialist might resent Waymo as a symbol of Big Tech. At the very least, a socialist who wields state control might want to nationalize Waymo if there are going to be autonomous cars.

“Protect the Waymos” voices support for the police and traditional property rules. This is a joke, keep in mind, so it’s not meant to make perfect sense. The Libertarian Right might be more likely to support individuals protecting themselves with their own weapons as opposed to relying on the police state.

Compare the meme world to the news. I feel like this might become quaint soon, so here’s what it looks like to use the Google News search function.

The Los Angeles Times reports

The autonomous ride-hailing service Waymo has suspended operations in downtown Los Angeles after several of its vehicles were set on fire during protests against immigration raids in the area.

At least five Waymo vehicles were destroyed over the weekend, the company said. Waymo removed its vehicles from downtown but continues to operate in other parts of Los Angeles.

The flaming Waymo image is all over the news and internet. For one thing, people are interested in it because it’s real. This was supposed to be the year of deepfakes, and yet it’s mostly real images and real gaffes that are still making the news.

At this moment in time, most Americans do not yet have Waymo in their cities. Oddly enough, losing an inventory of several cars might be well worth the publicity the company is receiving. Two million viewers of the video of the little driverless cars making an orderly exit from Los Angeles might come away thinking driverless cars are safe and sensible.

Here are more posts from my long-standing interest in cartoons and internet culture.

Kyla Scanlon also felt like the little cars on fire was worth a newsletter post. She wrote

We scroll past the burning car to see arguments about whether the burning was justified, then scroll past those to see memes about the arguments, then scroll past those to see counter-memes. The cycle feeds itself!

Retiring for $100 per Month

Everybody follows a different path. Sometimes that path includes a late start on saving for retirement. Say that you have $0 in your retirement account right now. Is it too late? What can you get as a result of contributing $100 per month? Maybe more than you think.

Let’s start with an annuity equation that tells us our balance at retirement with some assumptions baked in. Let’s assume that we have zero dollars saved and contribute $100 per month. What rate of return do we earn? The S&P earns an average of 10% per year, which may not keep happening. We can conservatively assume 7.5%, but there are other concerns. Taxes and inflation will both eat away at that. Let’s subtract 2.5% for inflation with the Fisher approximation, leaving a real rate of return of 5%. We’ll chop off 20% due to taxes*. Below is the annuity equation that tells us the balance at retirement, depending on how many years from now you retire.

Assuming that you retire at 65 years of age, the graph below describes your balance at retirement depending on the age at which you started saving $100 per month. Of course, it’s not the balance that most people are worried about. Rather, we care about the implied monthly retirement check. The graph describes that on the right axis too, assuming that constant real payments will be made forever as perpetuity payments. We can see that getting started early matters a lot. But starting at age 40 still gets you real monthly retirement payments that are just shy of $200. That’s not too shabby.

Of course, nobody receives all of the perpetuity payments.

Continue reading

The End of Easy Student Loans

The Senate Health, Education, Labor and Pensions Committee is proposing to cut off student loans for programs whose graduates earn less than the median high school graduate. The House proposed a risk-sharing model where colleges would partly pay back the federal government when their students fail to pay back loans themselves. Both the House and Senate propose to cap how much students can borrow for graduate loans. Both would reduce federal spending on higher ed by about $30-$35 billion per year, cutting the size of the $700 billion higher ed sector by 4-5%. I expected that something like this would happen eventually, especially after the student loan forgiveness proposals of 2022:

While we aren’t getting real reform now, I do think forgiveness makes it more likely that we’ll see reform in the next few years. What could that look like?

The Department of Education should raise its standards and stop offering loans to programs with high default rates or bad student outcomes. This should include not just fly-by-night colleges, but sketchy masters degree programs at prestigious schools.

Colleges should also share responsibility when they consistently saddle students with debt but don’t actually improve students’ prospects enough to be able to pay it back. Economists have put a lot of thought into how to do this in a manner that doesn’t penalize colleges simply for trying to teach less-prepared students.

I’d bet that some reform along these lines happens in the 2020’s, just like the bank bailouts of 2008 led to the Dodd-Frank reform of 2010 to try to prevent future bailouts. The big question is, will this be a pragmatic bipartisan reform to curb the worst offenders, or a Republican effort to substantially reduce the amount of money flowing to a higher ed sector they increasingly dislike?

Of course, there is a lot riding on the details. How exactly do you calculate the income of graduates of a program compared to high school grads? The Senate proposal explains their approach starting on page 58. They want to compare the median income of working students 4 years after leaving their program (whether they graduated or dropped out, but exempting those in grad school) to the median income of those with only a high school diploma who are age 25-34, working, and not in school.

Nationally I calculate that this would make for a floor of $31,000. That is, the median student who is 4 years out from your program and is working should be earning at least $31k. In practice the bill would implement a different number for each state. This seems like a low bar in general, though you could certainly quibble with it. For instance, those 4 years out from a program may be closer to age 25 than age 34, but income typically rises with age during those years. If you compare them to 26 year old high school grads, the national bar would be just $28k.

What sorts of programs have graduates making less than $31k per year?

Continue reading

The Growth in Wealth is Not Primarily Driven by Rising Home Prices

As I have discussed in many previous blog posts, young people today have a lot more wealth than past generations at the same point in their life. But we also know that housing prices have increased dramatically in recent years, and that for most families their home is their largest source of wealth.

Does this imply that the increase in wealth young Americans have seen is primarily driven by increased housing prices? If so, this would paint a less optimistic picture of the wealth of young people today, since the value of your home that you usually can’t easily convert into other consumption.

If we look at the past 5 years (2019Q4 to 2024Q4), the total wealth US households under the age of 40 increased by $5 trillion, in nominal terms. That’s not adjusted for inflation, but we don’t need to do so because we can look at how much each asset class increased in nominal terms as well. The total value of assets for households under age 40 increased by $5.86 trillion.

Here’s how the various classes of assets have increased since 2019Q4:

Continue reading

The Comeback of Gold as Money

According to Merriam-Webster, “money” is: “something generally accepted as a medium of exchange, a measure of value, or a means of payment.”  Money, in its various forms, also serves as a store of value.  Gold has maintained the store of value function all though the past centuries, including our own times; as an investment, gold has done well in the past couple of decades. I plan to write more later on the investment aspect, but here I focus on the use of physical gold as a means of payment or exchange, or as backing a means of exchange.

Gold, typically in the form of standardized coins, served means of exchange function for thousands of years. Starting in the Renaissance, however, banks started issuing paper certificates which were exchangeable for gold. For daily transactions, the public found it more convenient to handle these bank notes than the gold pieces themselves, and so these notes were used instead of gold as money.     

In the late nineteenth and early twentieth centuries, leading paper currencies like the British pound and the U.S. dollar were theoretically backed by gold; one could turn in a dollar and convert it to the precious metal. Most countries dropped the convertibility to gold during the Great Depression of the 1930’s, so their currencies became entirely “fiat” money, not tied to any physical commodity. For the U.S. dollar, there was limited convertibility to gold after World War II as part of the Bretton Woods system of international currencies, but even that convertibility ended in 1971. In fact, it was illegal for U.S. citizens to own much in the way of physical gold from FDR’s (infamous?) executive order in 1933 until Gerald Ford’s repeal of that order in 1977.

So gold has been essentially extinct as active money for nearly a hundred years. The elite technocrats who manage national financial affairs have been only too happy to dance on its grave. Keynes famously denounced the gold standard as a “barbarous relic”, standing in the way of purposeful management of national money matters.

However, gold seems to be making something of a comeback, on several fronts. Most notably, several U.S. states have promoted the use of gold in transactions. Deep-red Utah has led the way.  In 2011, Utah passed the Legal Tender Act, recognizing gold and silver coins issued by the federal government as legal tender within the state. This legislation allows individuals to transact in gold and silver coins without paying state capital gains tax.  The Utah House and Senate passed bills in 2025 to authorize the state treasurer to establish a precious metals-backed electronic payment platform, which would enable state vendors to opt for payments in physical gold and silver. The Utah governor vetoed this bill, though, claiming it was “operationally impractical.” 

Meanwhile, in Texas:

The new legislation, House Bill 1056, aims to give Texans the ability, likely through a mobile app or debit card system, to use gold and silver they hold in the state’s bullion depository to purchase groceries or other standard items.

The bill would also recognize gold and silver as legal tender in Texas, with the caveat that the state’s recognition must also align with currency laws laid out in the U.S. Constitution.

“In short, this bill makes gold and silver functional money in Texas,” Rep. Mark Dorazio (R-San Antonio), the main driving force behind the effort, said during one 2024 presentation. “It has to be functional, it has to be practical and it has to be usable.”

Arkansas and Florida have also passed laws allowing the use of gold and silver as legal tender. A potential problem is that under current IRS law, gold and silver are generally classified as collectibles and subject to potential capital gains taxes when transactions occur. Texas legislator Dorazio has argued that liability would go away if the metals are classified as functional money, although he’s also acknowledged the tax issue “might end up being decided by the courts.”

But as Europeans found back in the day, carrying around actual clinking gold coins for purchasing and making change is much more of a hassle than paper transactions. And so, various convenient payment or exchange methods, backed by physical gold, have recently arisen.

Since it is relatively easy and lucrative to spawn a new cryptocurrency (which is why there are thousands of them), it is not surprising that there are now several coins supposedly backed by bullion. These include include Paxos Gold (PAXG) and Tether Gold (XAUT). The gold of Paxos is stored in the worldwide vaults of Brinks, and is regularly audited by a credible third party. Tether gold supposedly resides somewhere in Switzerland. The firm itself is incorporated in the British Virgin Islands. Tether in general does not conduct regular audits; its official statements dance around that fact. These crypto coins, like bullion itself or various funds like GLD that hold gold, are in practice probably mainly an investment vehicle (store of value), rather than an active medium of exchange.

However, getting down to the consumer level of payment convenience, we now have a gold-backed credit card (Glint) and debit card (VeraCash Mastercard). Both of these hold their gold in Swiss vaults. The funds you place with these companies have gold allocated to them, so these are a (seemingly cost-effective) means to own gold. If you get nervous, you can actually (subject to various rules) redeem your funds for actual shiny yellow metal.

United States Vs Cruikshank (1876); ICE vs Los Angeles (ongoing)

Cruikshank played a crucial role in terminating Reconstruction and launching the one-party, segregationist regime of “Jim Crow” that prevailed in the South until the 1960s. The circuit court opinion of Justice Joseph Bradley unleashed the second and decisive phase of Reconstruction-era terrorism…” – Pope, James Gray. “Snubbed Landmark: Why United States v. Cruikshank (1876) Belongs at the Heart of the American Constitutional Canon.” Harv. CR-CLL Rev. 49 (2014): 385.

The Civil War was over, but the seceding states remained in open conflict with the federal government. Southern states, particulary those with majority Black populations, were desperate to terminate institutional reconstruction and purge the federal agents tasked with ensuring Black voting rights. The levers of state government were still in White hands, but that control was becoming tenuous. It is not wholly outlandish to suggest that Jim Crow as we know it may never have come to be if the US Supreme Court had not handed down a now infamous decision that effectively left Black men and women to fend for themselves. Freed from slavery only 13 years earlier, they now had to contend with state and local governments intent on maintaining the status quo of White supremacy in every way possible. It would be nearly a hundred years before the Voting Rights Act of 1965 would begin to restore the franchise to Black individuals.

California is in full conflict with federal government as we speak. Federal agents under the moniker of ICE are attempting to detain and subsequently deport individuals they deem to be of questionable legal residence. There have been multiple examples of individuals with fully legal claims to residence in the form of green cards, student visas, or full blown birthright citizenship who have been taken into custody by ICE and CBP agents (masked, armed, and in full military fatigues). Absent familial notification or any form of due process, there was always the question of whether a state authority would ever treat these takings of individuals as extralegal kidnappings.

Am I using inflammatory language? I’m not sure that I am. ICE and CBP officials have make strong declarations that they believe themselves to be unbeholden to court decisions, due process, or the Constitution. State and local law enforcement in California have made it clear that they will not aid ICE in any way shape or form save preventing violence in the streets as protesters have arrived in sufficient numbers that ICE agents were effectively herded into narrow spaces and prevented from exiting with the individuals they had detained.

Just in case it is not patently obvious how I feel on the matter, the protesters are on the right side of history. The federal government is overreaching in a more gratuitous and unconstitutional manner than at any moment in the previous 40 years. This is, in terms of our federalist structure, the inverse of Jim Crow and Cruikshank. State governments are in position to defend the liberties and rights of their residents against the extralegal encroachment of federal agents. If anything, I find myself grateful that such a standoff is occurring in California, a state with the scale and resources to stand against the federal government. I know the Trump administration is threatening to “cut off” California from federal money, but that’s a strange tactic. California net loses between $71 and 83 billion per year in federal spending minus taxes paid by residents. California is the 4th largest economy in the world. California is a mess, their housing market is atrocious, they manage their forests and wildfire prevention quite poorly, but it is nonetheless the single most economically important state in the US by a cavernous margin. California can say “no” to the federal government. They may find themselves with national guard troops on their streets. They can ask then ask them to be removed. They can ask ICE and CBP to leave.

This is a significant test of our federalist republic. Cruikshank served as the political fulcrum of its time by denying the federal government’s obligation to intervene and in doing so handed the power to deny basic constitutional rights to state and local governments, and the country has in many ways never wholly recovered. As we speak the federal government is taking action on behalf of the current presidential administration to deny basic constitutional rights. How a state’s ability to protect those rights against the federal government on behalf of its residents plays out may be the political fulcrum of our next 50 years.

Papers about Economists Using LLMs

  1. The most recent (published in 2025) is this piece about doing data analytics that would have been too difficult or costly before. Link and title: Deep Learning for Economists

Considering how much of frontier economics revolves around getting new data, this could be important. On the other hand, people have been doing computer-aided data mining for a while. So it’s more of a progression than a revolution, in my expectation.

2. Using LLMs to actually generate original data and/or test hypotheses like experimenters: Large language models as economic agents: what can we learn from homo silicus? and Automated Social Science: Language Models as Scientist and Subjects

3. Generative AI for Economic Research: Use Cases and Implications for Economists

Korinek has a new supplemental update as current as December 2024: LLMs Learn to Collaborate and Reason: December 2024 Update to “Generative AI for Economic Research: Use Cases and Implications for Economists,” Published in the Journal of Economic Literature 61 (4)

4. For being comprehensive and early: How to Learn and Teach Economics with Large Language Models, Including GPT

5. For giving people proof of a phenomenon that many people had noticed and wanted to discuss: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Alert: We will soon have an update for current web-enabled models! It would seem that hallucination rates are going down but the problem is not going away.

6. This was published back in 2023. “ChatGPT ranked in the 91st percentile for Microeconomics and the 99th percentile for Macroeconomics when compared to students who take the TUCE exam at the end of their principles course.” (note the “compared to”): ChatGPT has Aced the Test of Understanding in College Economics: Now What?

References          

Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist69(1), 80-87. https://doi.org/10.1177/05694345231218454 (Original work published 2024)

Cowen, Tyler and Tabarrok, Alexander T., How to Learn and Teach Economics with Large Language Models, Including GPT (March 17, 2023). GMU Working Paper in Economics No. 23-18, Available at SSRN: https://ssrn.com/abstract=4391863 or http://dx.doi.org/10.2139/ssrn.4391863

Dell, M. (2025). Deep Learning for Economists. Journal of Economic Literature, 63(1), 5–58. https://doi.org/10.1257/jel.20241733

Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has Aced the Test of Understanding in College Economics: Now What? The American Economist68(2), 233-245. https://doi.org/10.1177/05694345231169654 (Original work published 2023)

Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv Preprint arXiv:2301.07543.

Korinek, A. (2023). Generative AI for Economic Research: Use Cases and Implications for Economists. Journal of Economic Literature, 61(4), 1281–1317. https://doi.org/10.1257/jel.20231736

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated Social Science: Language Models as Scientist and Subjects (Working Paper No. 32381). National Bureau of Economic Research. https://doi.org/10.3386/w32381

Per Capita Consumption: 1990 Vs 2024

This is an update to a previous post that I did on per-capita real consumption in 1990 vs 2021. As of 2021, we still weren’t sure after the pandemic what was transitory vs structural, and it was unclear whether incomes would keep up with inflation. We now have three more years of data through 2024. News flash: We’re even richer.

I like to use the BEA real quantity indices. Those track what is actually consumed in volumes rather than by deflating total spending by price indices. Divided by population, we can calculate the real quantities of goods and services that people actually consumed per capita.

Even after the pandemic policies have settled down, we are still SO MUCH RICHER – and even richer than we were with all of the pandemic-related stimulus. The worst consumption category since the pandemic has been food and beverage for off-premise consumption, and that is *up* 4.6% since 2020, increasing 31% since 1990. So, while I understand that people can’t enjoy the the low prices of yesteryear, we are still better off in that category than pre-pandemic. In the other categories, everything is awesome.

Since 1990, our consumption of communication services has risen 332%, our houses are 254% better furnished, and we have 118% greater quality-adjusted clothing consumption. All of this is already adjusted for inflation and is per-capita. Since the pandemic, these numbers are still up by 20.4%, 9.8%, and 31.1% respectively. People didn’t like the post-pandemic inflation. I get that. But these improvements in average consumption are mind boggling.

Continue reading