Meta AI Chief Yann LeCun Notes Limits of Large Language Models and Path Towards Artificial General Intelligence

We noted last week Meta’s successful efforts to hire away the best of the best AI scientists from other companies, by offering them insane (like $300 million) pay packages. Here we summarize and excerpt an excellent article in Newsweek by Gabriel Snyder who interviewed Meta’s chief AI scientist, Yann LeCun. LeCun discusses some inherent limitations of today’s Large Language Models (LLMs) like ChatGPT. Their limitations stem from the fact that they are based mainly on language; it turns out that human language itself is a very constrained dataset.  Language is readily manipulated by LLMs, but language alone captures only a small subset of important human thinking:

Returning to the topic of the limitations of LLMs, LeCun explains, “An LLM produces one token after another. It goes through a fixed amount of computation to produce a token, and that’s clearly System 1—it’s reactive, right? There’s no reasoning,” a reference to Daniel Kahneman’s influential framework that distinguishes between the human brain’s fast, intuitive method of thinking (System 1) and the method of slower, more deliberative reasoning (System 2).

The limitations of this approach become clear when you consider what is known as Moravec’s paradox—the observation by computer scientist and roboticist Hans Moravec in the late 1980s that it is comparatively easier to teach AI systems higher-order skills like playing chess or passing standardized tests than seemingly basic human capabilities like perception and movement. The reason, Moravec proposed, is that the skills derived from how a human body navigates the world are the product of billions of years of evolution and are so highly developed that they can be automated by humans, while neocortical-based reasoning skills came much later and require much more conscious cognitive effort to master. However, the reverse is true of machines. Simply put, we design machines to assist us in areas where we lack ability, such as physical strength or calculation.

The strange paradox of LLMs is that they have mastered the higher-order skills of language without learning any of the foundational human abilities. “We have these language systems that can pass the bar exam, can solve equations, compute integrals, but where is our domestic robot?” LeCun asks. “Where is a robot that’s as good as a cat in the physical world? We don’t think the tasks that a cat can accomplish are smart, but in fact, they are.”

This gap exists because language, for all its complexity, operates in a relatively constrained domain compared to the messy, continuous real world. “Language, it turns out, is relatively simple because it has strong statistical properties,” LeCun says. It is a low-dimensionality, discrete space that is “basically a serialized version of our thoughts.”  

[Bolded emphases added]

Broad human thinking involves hierarchical models of reality, which get constantly refined by experience:

And, most strikingly, LeCun points out that humans are capable of processing vastly more data than even our most data-hungry advanced AI systems. “A big LLM of today is trained on roughly 10 to the 14th power bytes of training data. It would take any of us 400,000 years to read our way through it.” That sounds like a lot, but then he points out that humans are able to take in vastly larger amounts of visual data.

Consider a 4-year-old who has been awake for 16,000 hours, LeCun suggests. “The bandwidth of the optic nerve is about one megabyte per second, give or take. Multiply that by 16,000 hours, and that’s about 10 to the 14th power in four years instead of 400,000.” This gives rise to a critical inference: “That clearly tells you we’re never going to get to human-level intelligence by just training on text. It’s never going to happen,” LeCun concludes…

This ability to apply existing knowledge to novel situations represents a profound gap between today’s AI systems and human cognition. “A 17-year-old can learn to drive a car in about 20 hours of practice, even less, largely without causing any accidents,” LeCun muses. “And we have millions of hours of training data of people driving cars, but we still don’t have self-driving cars. So that means we’re missing something really, really big.”

Like Brooks, who emphasizes the importance of embodiment and interaction with the physical world, LeCun sees intelligence as deeply connected to our ability to model and predict physical reality—something current language models simply cannot do. This perspective resonates with David Eagleman’s description of how the brain constantly runs simulations based on its “world model,” comparing predictions against sensory input. 

For LeCun, the difference lies in our mental models—internal representations of how the world works that allow us to predict consequences and plan actions accordingly. Humans develop these models through observation and interaction with the physical world from infancy. A baby learns that unsupported objects fall (gravity) after about nine months; they gradually come to understand that objects continue to exist even when out of sight (object permanence). He observes that these models are arranged hierarchically, ranging from very low-level predictions about immediate physical interactions to high-level conceptual understandings that enable long-term planning.

[Emphases added]

(Side comment: As an amateur reader of modern philosophy, I cannot help noting that these observations about the importance of recognizing there is a real external world and adjusting one’s models to match that reality call into question the epistemological claim that “we each create our own reality”.)

Given all this, developing the next generation of artificial intelligence must, like human intelligence, embed layers of working models of the world:

So, rather than continuing down the path of scaling up language models, LeCun is pioneering an alternative approach of Joint Embedding Predictive Architecture (JEPA) that aims to create representations of the physical world based on visual input. “The idea that you can train a system to understand how the world works by training it to predict what’s going to happen in a video is a very old one,” LeCun notes. “I’ve been working on this in some form for at least 20 years.”

The fundamental insight behind JEPA is that prediction shouldn’t happen in the space of raw sensory inputs but rather in an abstract representational space. When humans predict what will happen next, we don’t mentally generate pixel-perfect images of the future—we think in terms of objects, their properties and how they might interact

This approach differs fundamentally from how language models operate. Instead of probabilistically predicting the next token in a sequence, these systems learn to represent the world at multiple levels of abstraction and to predict how their representations will evolve under different conditions.

And so, LeCun is strikingly pessimistic on the outlook for breakthroughs in the current LLM’s like ChatGPT. He believes LLMs will be largely obsolete within five years, except for narrower purposes, and so he tells upcoming AI scientists to not even bother with them:

His belief is so strong that, at a conference last year, he advised young developers, “Don’t work on LLMs. [These models are] in the hands of large companies, there’s nothing you can bring to the table. You should work on next-gen AI systems that lift the limitations of LLMs.”

This approach seems to be at variance with other firms, who continue to pour tens of billions of dollars into LLMs. Meta, however, seems focused on next-generation AI, and CEO Mark Zuckerberg is putting his money where his mouth is.

Will LLMs get us the Missing Data for Solving Physics?

Tyler suggested that a “smarter” LLM could not master the unconquered intellectual territory of integrating general relatively and quantum mechanics.

Forget passing Ph.D. level qualifying exams. (j/k James) Are the AI’s going to significantly surpass human efforts in generating new knowledge?

What exactly is the barrier to solving the fundamental mysteries of physics? How do we experimentally confirm that all matter breaks down to vibrating strings?

In a podcast episode of Within Reason, Brian Greene says that we can imagine an experiment that would test the proposed unifying String Theory. The Large Hadron Collider is not big enough (17 miles in circumference is too small). We would need a particle accelerator as big as a galaxy.

ChatGPT isn’t going to get us there. However, Brian Greene did suggest that there is a possibility that an advance in mathematics could get us closer to being able to work with the data we have.

Beh Yeoh summarized what he heard from Tyler et al. at a live event on how fast the acceleration in our knowledge will get boosted from AI. They warned that some areas will hit bottlenecks and therefore not advance very fast. Anything that require clinical trials, for example, isn’t going to proceed at breakneck speed. Ben warns that “Protein folding was a rare success” so we shouldn’t get too too excited about acceleration in biotech. If advances in physics require bigger and better physical tools to do more advanced experimental observations, then new AI might not get us far.

However, one of the categories that made Yeoh’s list of where new AI might accelerate progress is “mathematics,” because developing new theories does not face the same kind of physical constraints.

So, we are unlikely to obtain new definitive tests of String Theory to the extent that it is a capital-intensive field. The scenario for AI advances to bring a solution to this empirical question in my lifetime is probably if the solution comes from advances in mathematics so that we can reduce our reliance on new observational data.

Related links:
my article for the Gospel Coalition – We are not “building God,” despite some claims.
my article for EconLog – AI will be constrained by the same problem that David Hume faced. AI can predict what is likely to occur in the future based on what it has observed in the past.

“The big upward trend in Generative AI/LLM tool use in 2025 continues but may be slowing.” Have we reached a plataue, at least temporarily? Have we experienced the big upswing already in productivity, and it’s going to level out now? At least programming will be less painful forever after?

LLM Hallucination of Citations in Economics Persists with Web-Enabled Models” I realize that, as of today, you can pay for yet-better models than what we tested. But if web-enabled 4o can’t cite Krugman properly, you do wonder if “6o” will be integrating general relatively and quantum mechanics. A slightly longer context window probably isn’t going to do it.

Writing Humanity’s Last Exam

When every frontier AI model can pass your tests, how do you figure out which model is best? You write a harder test.

That was the idea behind Humanity’s Last Exam, an effort by Scale AI and the Center for AI Safety to develop a large database of PhD-level questions that the best AI models still get wrong.

The effort has proven popular- the paper summarizing it has already been cited 91 times since its release on March 31st, and the main AI labs have been testing their new models on the exam. xAI announced today that its new Grok 4 model has the highest score yet on the exam, 44.4%.

Current leaderboard on the Humanity’s Last Exam site, not yet showing Grok 4

The process of creating the dataset is a fascinating example of a distributed academic mega-project, something that is becoming a trend that has also been important in efforts to replicate previous research. The organizers of Humanity’s Last Exam let anyone submit a question for their dataset, offering co-authorship to anyone whose question they accepted, and cash prizes to those who had the best questions accepted. In the end they wound up with just over 1000 coauthors on the paper (including yours truly as one very minor contributor), and gave out $500,000 to contributors of the very best questions (not me), which seemed incredibly generous until Scale AI sold a 49% stake in their company to Meta for $14.8 billion in June.

Source: Figure 4 of the paper

Here’s what I learned in the process of trying to stump the AIs and get questions accepted into this dataset:

  1. The AIs were harder than I expected to stump because they used frontier models rather than the free-tier models I was used to using on my own. If you think AI can’t answer your question, try a newer model
  2. It was common for me to try a question that several models would get wrong, but at least one would still get right. For me this was annoying because questions could only be accepted if every model got them wrong. But of course if you want to get a correct answer, this means trying more models is good, even if they are all in the same tier. If you can’t tell what a correct answer looks like and your question is important, make sure to try several models and see if they give different answers
  3. Top models are now quite good at interpreting regression results, even when you try to give them unusually tricky tables
  4. AI still has weird weaknesses and blind spots; it can outperform PhDs in the relevant field on one question, then do worse than 3rd graders on the next. This exam specifically wanted PhD-level questions, where a typical undergrad not only couldn’t answer the question, but probably couldn’t even understand what was being asked. But it specifically excluded “simple trick questions”, “straightforward calculation/computation questions”, and questions “easily answerable by everyday people”, even if all the AIs got them wrong. My son had the idea to ask them to calculate hyperfactorials; we found some relatively low numbers that stumped all the AI models, but the human judges ruled that our question was too simple to count. On a question I did get accepted, I included an explanation for the human judges of why I thought it wasn’t too simple.

I found this to be a great opportunity to observe the strengths and weaknesses of frontier models, and to get my name on an important paper. While the AI field is being driven primarily by the people with the chops to code frontier models, economists still have lot we can contribute here, as Joy has shown. Any economist looking for the next way to contribute here should check out Anthropic’s new Economic Futures Program.

Hallucination as a User Error

You don’t use a flat head screwdriver to drill a hole in a board. You should know to use a drill.

I appreciate getting feedback on our manuscript, “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models,” via X/Twitter. @_jannalulu wrote: “that paper only tested 4o (which arguably is a bad enough model that i almost never use it).”

Since the scope and frequency of hallucinations came as a surprise to many LLM users, they have often been used as a ‘gotcha’ to criticize AI optimists. People, myself included, have sounded the alarm that hallucinations could infiltrate articles, emails, and medical diagnoses.

The feedback I got from power users on Twitter this week made me think that there might be a cultural shift in the medium term. (Yes, we are always looking for someone to blame.) Hallucinations will be considered the fault of the human user who should have:

  1. Used a better model (learn your tools)
  2. Written a better prompt (learn how to use your tools)
  3. Assigned the wrong task to LLMs (it’s been known for over 2 years that general LLM models hallucinate citations). What did you expect from “generative” AI? LLMs are telling you what literature ought to exist as opposed to what does exist.

My Perfunctory Intern

A couple years ago, my Co-blogger Mike described his productive, but novice intern. The helper could summarize expert opinion, but they had no real understanding of their own. To boot, they were fast and tireless. Of course, he was talking about ChatGPT. Joy has also written in multiple places about the errors made by ChatGPT, including fake citations.

I use ChatGPT Pro, which has Web access and my experience is that it is not so tireless. Much like Mike, I have used ChatGPT to help me write Python code. I know the basics of python, and how to read a lot of of it. However, the multitude of methods and possible arguments are not nestled firmly in my skull. I’m much faster at reading, rather than writing Python code. Therefore, ChatGPT has been amazing… Mostly.

I have found that ChatGPT is more like an intern than many suppose:

Continue reading

Counting Hallucinations by Web-Enabled LLMs

In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon University estimates that half of Americans now use LLMs. In the Spring of 2025, we used the same prompts, based on the JEL categories, to obtain a comprehensive set of responses from LLMs about topics in economics.

Our new report on the state of citations is available at SSRN: “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models

What did we find? Would you expect the models to have improved since 2023? LLMs have gotten better and are passing ever more of what used to be considered difficult tests. (Remember the Turing Test? Anyone?) ChatGPT can pass the bar exam for new lawyers. And yet, if you ask ChatGPT to write a document in the capacity of a lawyer, it will keep making the mistake of hallucinating fake references. Hence, we keep seeing headlines like, “A Utah lawyer was punished for filing a brief with ‘fake precedent’ made up by artificial intelligence

What we call GPT-4o WS (Web Search) in the figure below was queried in April 2025. This “web-enabled” language model is enhanced with real-time internet access, allowing it to retrieve up-to-date information rather than relying solely on static training data. This means it can answer questions about current events, verify facts, and provide live data—something traditional models, which are limited to their last training cutoff, cannot do. While standard models generate responses based on patterns learned from past data, web-enabled models can supplement that with fresh, sourced content from the web, improving accuracy for time-sensitive or niche topics.

At least one third of the references provided by GPT-4o WS were not real! Performance has not significantly improved to the point where AI can write our papers with properly incorporated attribution of ideas. We also found that the web-enabled model would pull from lower quality sources like Investopedia even when we explicitly stated in the prompt, “include citations from published papers. Provide the citations in a separate list, with author, year in parentheses, and journal for each citation.” Even some of the sources that were not journal articles were cited incorrectly. We provide specific examples in our paper.

In closing, consider this quote from an interview with Jack Clark, co-founder of Anthropic:

The best they had was a 60 percent success rate. If I have my baby, and I give her a robot butler that has a 60 percent accuracy rate at holding things, including the baby, I’m not buying the butler.

Excluding “Non-Excludable” Goods

Intro microeconomics classes teach that some goods are “non-excludable”, meaning that people who don’t pay for them can’t be stopped from using them. This can lead to a “tragedy of the commons”, where the good gets overused because people don’t personally bear the cost of using it and don’t care about the costs they impose on others. Overgrazing land and overfishing the seas are classic examples.

Source: Microeconomics, by Michael Parkin

Students sometimes get the impression that “excludability” is an inherent property of a good. But in fact, which goods are excludable is a function of laws, customs, and technologies, and these can change over time. Land might be legally non-excludable (and so over-grazed) when it is held in common, but become excludable when the land is privatized or when barbed wire makes enclosing it cheap. Over time, such changes have turned over-grazing into a relatively minor issue.

Overfishing remains a major problem, but this could be starting to change. Legal and technological changes have allowed for enclosed, private aquaculture on some coasts, which provide a large and growing share of all fish eaten by humans. Permitting systems put limits on catches in many countries’ waters, though the high seas remain a true tragedy of the commons for now.

While countries have tried to enforce limits on catches in their national waters, monitoring how many fish every boat is taking has been challenging, so illegal overfishing has remained widespread. But technology is in the process of changing this. For instance, ThayerMahan is developing hydrophone arrays that use sound to track boats:

Technologies like hydrophones and satellites, if used well, will increasingly make public waters more “excludable” and reduce “tragedy of the commons” overfishing.

Saving Money by Ordering Car Parts from Amazon or eBay

Here is a personal economical anecdote from this week. A medium-sized dead branch fell from a tall tree and ripped off the driver side mirror on my old Honda. My local repair shop said it would cost around $600 to replace it. That is a significant percentage of what the old clunker is worth. Ouch.

They kindly noted that most of that cost would was ordering a replacement mirror assembly from Honda, which would cost over $400 and take several days to arrive.  I asked if I could try to get a mirror from a junkyard, to save money. The repair guy said they would be willing to install a part I brought in, but suggested eBay or Amazon instead.

Back 20 years ago, before online commerce was so established, my local repair shop would routinely save us money by getting used parts from some sort of junkyard network.
So, I started looking into that route. First, junkyards are not junkyards anymore, they are “salvage yards.” Second, it turns out that to remove a side mirror from a Honda is not a simple matter. You have to remove the inside whole plastic door panel to get at the mirror mounting screws, and removing that panel has some complications. Also, I could not find a clear online resource for locating parts at regional salvage yards. It looks like you have to drive to a salvage yard, and perhaps have them search some sort of database to find a comparable vehicle somewhere that might have the part you want.


All this seemed like a lot of hassle, so I went to eBay, and found a promising looking new replacement part there for about $56, including shipping. It would take about a week to get here (probably being direct shipped from China). On Amazon, I found essentially the same part for about $63, that would get here the next day. For the small difference and price, I went the Amazon route, partly for the no hassle returns if the part turned out to be defective and partly because I get 5% back on my Amazon credit card there.
I just got the car back from the repair shop with the replacement mirror, and it works fine. The total cost, with labor was about $230, which is much better than the original $600+ estimate.


I’m not sure how broadly to generalize this experience. Some further observations:

( 1 ) For a really critical car part, I’d have to consider carefully if the Chinese knock-off would perform appreciably worse than some name-brand part – -although, I believe many repair shops often use parts that are not strictly original parts.

( 2 ) Commonly replaced parts like oil and air filters are typically cheaper to buy on-line than from your local Auto Zone or other local merchant. I like supporting local shops, so sometimes I eat the few extra $$ and shopping time, and buy from bricks and mortar.

( 3 ) Some repair shops make significant money on their markup on parts, and so they might not be happy about you bringing in your own parts. They also might decline to warrant the operation of that part. And many big box franchise repair shops may simply refuse to install customer-supplied parts.

( 4 ) For a newish car, still under warranty, the manufacturer warranty might be affected by using non-original parts.

( 5 ) Back to junk/salvage yards: there are some car parts, so-called hard parts, that are expected to last the life of the car. Things like the mounting brackets for engine parts. Typically, no spares of these are manufactured. So, if one of those parts gets dinged up in an accident, your only option may be used parts taken from a junker.

Illusions of Illusions of Reasoning

Even since Scott’s post on Tuesday of this week, a new response has been launched titled “The Illusion of the Illusion of the Illusion of Thinking

Abstract (emphasis added by me): A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an “accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations in their reasoning capabilities when faced with planning puzzles of increasing complexity. A compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking, argued these findings are not evidence of reasoning failure but rather artifacts of flawed experimental design, such as token limits and the use of unsolvable problems. This paper provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify critical methodological flaws that invalidate the most severe claims of the original paper, their own counter-evidence and conclusions may oversimplify the nature of model limitations. By shifting the evaluation from sequential execution to algorithmic generation, their work illuminates a different, albeit important, capability. We conclude that the original “collapse” was indeed an illusion created by experimental constraints, but that Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

As am writing a new manuscript about hallucination of web-enabled models, this is close to what I am working on. Conjuring up fake academic references might point to a lack of true reasoning ability.

Do Pro and Dantas believe that LLMs can reason? What they are saying, at least, is that evaluating AI reasoning is difficult. In their words, the whole back-and-forth “highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from sophisticated pattern matching of familiar problems…”

The fact that the first sentence of the paper contains the bigram “true reasoning” is interesting in itself. No one dobuts that LLMs are reasoning anymore, at least within their own sandboxes. Hence there have been Champagne jokes going around of this sort:

If you’d like to read a response coming from o3 itself, Tyler pointed me to this:

Did Apple’s Recent “Illusion of Thinking” Study Expose Fatal Shortcomings in Using LLM’s for Artificial General Intelligence?

Researchers at Apple last week published with the provocative title, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”  This paper has generated uproar in the AI world. Having “The Illusion of Thinking” right there in the title is pretty in-your-face.

Traditional Large Language Model (LLM) artificial intelligence programs like ChatGPT train on massive amounts of human-generated text to be able to mimic human outputs when given prompts. A recent trend (mainly starting in 2024) has been the incorporation of more formal reasoning capabilities into these models. The enhanced models are termed Large Reasoning Models (LRMs). Now some leading LLMs like Open AI’s GPT, Claude, and the Chinese DeepSeek exist both in regular LLM form and also as LRM versions.

The authors applied both the regular (LLM) and “thinking” LRM versions of Claude 3.7 Sonnet and DeepSeek to a number of mathematical type puzzles. Open AI’s o-series were used to a lesser extent. An advantage of these puzzles is that researchers can, while keeping the basic form of the puzzle, dial in more or less complexity.

They found, among other things, that the LRMs did well up to a certain point, then suffered “complete collapse” as complexity was increased. Also, at low complexities, LLMs actually outperform LRMs. And (perhaps the most vivid evidence of lack of actual understanding on the part of these programs), when they were explicitly offered an efficient direct solution algorithm in the prompt, the programs did not take advantage of it, but instead just kept grinding away in their usual fashion.

As might be expected, AI skeptics were all over the blogosphere, saying, I told you so, LLMs are just massive exercises in pattern matching, and cannot extrapolate outside of their training set. This has massive implications for what we can expect in the near or intermediate future. Among other things, the optimism about AI progress is largely what is fueling the stock market, and also capital investment in this area: Companies like Meta and Google are spending ginormous sums trying to develop artificial “general” intelligence, paying for ginormous amounts of compute power, with those dollars flowing to firms like Microsoft and Amazon building out data centers and buying chips from Nvidia. If the AGI emperor has no clothes, all this spending might come to a screeching crashing halt.

Ars Technica published a fairly balanced account of the controversy, concluding that, “Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them… especially for coding and brainstorming and writing.”

Comments on this article included one like:

LLMs do not even know what the task is, all it knows is statistical relationships between words.   I feel like I am going insane. An entire industry’s worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.

And

if we equate combinatorial play and pattern matching with genuinely “generative/general” intelligence, then we’re missing a key fact here. What’s missing from all the LLM hubris and enthusiasm is a reflexive consciousness of the limits of language, of the aspects of experience that exceed its reach and are also, paradoxically, the source of its actual innovations. [This is profound, he means that mere words, even billions of them, cannot capture some key aspects of human experience]

However, the AI bulls have mounted various come-backs to the Apple paper. The most effective I know of so far was published by Alex Lawsen, a researcher at LLM firm Open Philanthropy. Lawsen’s rebuttal, titled “The Illusion of the Illusion of Thinking,  was summarized by Marcus Mendes. To summarize the summary, Lawsen claimed that the models did not in general “collapse” in some crazy way. Rather, the models in many cases recognized that they would not be able to solve the puzzles given the constraints input by the Apple researchers. Therefore, they (rather intelligently) did not try to waste compute power by grinding away to a necessarily incomplete solution, but just stopped. Lawsen further showed that the ways Apple ran the LRM models did not allow them to perform as well as they could. When he made a modest, reasonable change in the operation of the LRMs,

Models like Claude, Gemini, and OpenAI’s o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success.

Lawsen’s conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks. At least in terms of algorithm generation.

And so, the great debate over the prospects of artificial general intelligence will continue.