How ChatGPT works from geography and Stephen Wolfram

By now, everyone should consider using ChatGPT and be familiar with how it works. I’m going to highlight resources for that.

My paper about how ChatGPT generates academic citations should be useful to academics as a way to quickly grasp the strengths and weakness of ChatGPT. ChatGPT often works well, but sometimes fails. It’s important to anticipate how it fails. Our paper is so short and simple that your undergraduates could read it before using ChatGPT for their writing assignments.

A paper that does this in a different domain is “GPT4GEO: How a Language Model Sees the World’s Geography” (Again, consider showing it to your undergrads because of the neat pictures, but probably walk through it together in class instead of assigning it as reading.) They describe their project: “To characterise what GPT-4 knows about the world, we devise a set of progressively more challenging experiments… “

For example, they asked ChatGPT about the populations of countries and found that: “For populations, GPT-4 performs relatively well with a mean relative error (MRE) of 3.61%. However, significantly higher errors [occur] … for less populated countries.”

ChatGPT will often say SOMETHING, if prompted correctly. It is often, at least slightly, wrong. This graph shows that most estimates of national populations were not correct and the performance was worse on countries that are less well-known. That’s exactly what we found in our paper on citations. We found that very famous books are often cited correctly, because ChatGPT is mimicking other documents that correctly cite those books. However, if there are not many documents to train on, then ChatGPT will make things up.

I love this figure from the geography paper showing how ChatGPT estimates the elevations of mountains. This visual should be all over Twitter.

There are 3 lines because they did the prompt three times. ChatGPT threw out three different wrong mountains. Is that kind of work good enough for your tasks? Often it is. The shaded area in the graph is the actual topography of the earth in those places. ChatGPT “knows” that this area of the world is a mountain. But it will just put out incorrect estimates of the exact elevation, instead of stating that it does not know the exact elevation of those areas of the world.

Another free (long, advanced) resource with great pictures is Stephen Wolfram’s 2023 blog article “What Is ChatGPT Doing … and Why Does It Work?” (YouTube version)

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.

If you feel like you already are proficient with using ChatGPT, then I would recommend Wolfram’s blog because you will learn a lot about math and computers.

Scott wrote “Generative AI Nano-Tutorial” here, which has the advantage of being much shorter than Wolfram’s blog.

EDIT: New 2023 overview paper (link from Lenny): “A Survey of Large Language Models

Using Phind for Academic References

There is a new generative AI product called Phind. Does it make up fake citations like ChatGPT? I will report experimentation with the free-tier product from January 7, 2023.

Phind is an intelligent search engine and assistant for programmers.

https://www.phind.com/about

First, I will run a test by using one of the prompts from our paper on ChatGPT errors (read my paper on hallucinated academic citations or the blog summary).

Prompt: “In less than 10 sentences, explain how the ideas of Hayek are connected to bailing out failed banks, and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

Friedrich Hayek, a prominent Austrian economist, advocated for a free market system where businesses and individuals make decisions based on individual knowledge and self-interest, rather than central planning by government or bureaucrats. His ideas suggest that interventions, such as bailing out failed banks, should be minimal and temporary, allowing the market to correct itself.

However, the financial crisis of 2008, including the failure of many banks, led to significant government intervention, including the bailout of banks. This action can be seen as a departure from Hayek’s principles, as it involves direct government intervention to stabilize the financial system.

According to a paper by Todd Zywicki (2009), the Federal Reserve’s artificial lowering of short-term interest rates and the resulting shift by consumers to adjustable rate mortgages (ARMs) triggered the housing bubble and subsequent crisis [Source 1](https://reason.org/wp-content/uploads/files/federal_reserve_monetary_policy_hayek_rule.pdf). This aligns with Hayek’s critique of central banking’s manipulation of interest rates.

Moreover, the Basel Accords, international capital adequacy and liquidity requirements for banks, were designed to encourage banks to hold securitized loans in their portfolios, thereby concentrating risk and leveraging up, thereby increasing their exposure to risk. This approach, contrary to Hayek’s preference for decentralization and competition, may have contributed to the financial crisis [Source 1](https://reason.org/wp-content/uploads/files/federal_reserve_monetary_policy_hayek_rule.pdf). In conclusion, while Hayek’s ideas of minimal government intervention and self-regulating markets provide a useful theoretical framework, practical implementation in complex and interconnected modern economies may require different strategies, including bank bailouts.

Are the citations real and correct? Phind simply is not trying to write with in-text citations and academic references. There is a mention of “Todd Zywicki (2009)” and also a weblink. The weblink is to a policy paper published by Reason. This policy paper is on the correct topic and the paper does exist. The Zywicki citations also exists, in the sense that if you go into the Reason paper it cites a real Wall Street Journal op-ed. So, if you follow the trail, you can see what document Phind learned to cite Zywicki from. It seems like it is stealing the citation more than reading the source documents. To be fair, sometimes human academics work that way, too.

What I was hoping for was in-text citations with a compiled bibliography at the end. Phind did not give me that.

I’ll try another test that I ran on ChatGPT and Bing in 2023.

The prompt is “In what paper did John Duffy propose that economists use experiments to study equilibrium selection?”  

Phind does not attempt an in-text citation in the response. The link in the paragraph (link number 2) points to John Duffy’s list of papers, which is better than just making up fake citations but also does not create a literature review for me. Phind is, as with the Hayek test above, providing breadcrumbs of links through which I can discover existent papers.

Is there a paper called “The Transition from Stagnation to Growth: An Adaptive Learning Approach”? Yes. And it is by Duffy.

Phind lists weblinks to sources. Has Phind done more for me than Google, on this search? Not much, in terms of finding and synthesizing references.

Continue reading

Do People Trust ChatGPT Writing?

My new working paper with Will Hickman is up on SSRN: Do People Trust Humans More Than ChatGPT?

We study whether people will pay for a fact-check on AI writing. ChatGPT can be very useful, but human readers should not trust every fact that it reports. Yesterday’s post was about ChatGPT writing false things that look real.

The reason participants in our experiment might pay for a fact-check is that they earn bonus payments based on whether they correctly identify errors in a paragraph. If participants believe that the paragraph does not contain any errors, they should not pay for a fact-check. However, if they have doubts, it is rational to pay for a fact-check and earn a smaller bonus, for certain.

Abstract: We explore whether people trust the accuracy of statements produced by large language models (LLMs) versus those written by humans. While LLMs have showcased impressive capabilities in generating text, concerns have been raised regarding the potential for misinformation, bias, or false responses. In this experiment, participants rate the accuracy of statements under different information conditions. Participants who are not explicitly informed of authorship tend to trust statements they believe are human-written more than those attributed to ChatGPT. However, when informed about authorship, participants show equal skepticism towards both human and AI writers. There is an increase in the rate of costly fact-checking by participants who are explicitly informed. These outcomes suggest that trust in AI-generated content is context-dependent.

Our original hypothesis was that people would be more trusting of human writers. That turned out to be only partially true. Participants who are not explicitly informed of authorship tend to trust statements they believe are human-written more than those attributed to ChatGPT.

We presented information to participants in different ways. Sometimes we explicitly told them about authorship (informed treatment) and sometimes we asked them to guess about authorship (uninformed treatment).

This graph (figure 5 in our paper) shows that the overall rate of fact-checking increased when subjects were given more explicit information. Something about being told that a paragraph was written by a human might have aroused suspicion in our participants. (The kids today would say it is “sus.”) They became less confident in their own ability to rate accuracy and therefore more willing to pay for a fact-check. This effect is independent of whether participants trust humans more than AI.

We are thinking of fact-checking as often a good thing, in the context of our previous work on ChatGPT hallucinations. So, one policy implication is that certain types of labels can cause readers to think critically. For example, Twitter labels automated accounts so that readers know when content has been chosen or created by a bot.

Our working paper is currently trending on SSRN top ten lists such as this one.

Suggested Citation:
Buchanan, Joy and Hickman, William, Do People Trust Humans More Than ChatGPT? (November 16, 2023). GMU Working Paper in Economics No. 23-38, Available at SSRN: https://ssrn.com/abstract=4635674

GPT-4 Generates Fake Citations

I am happy to share my latest publication at The American Economist: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Citation: Buchanan, J., Hill, S., & Shapoval, O. (2024). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist. 69(1), 80-87  https://doi.org/10.1177/05694345231218454

Blog followers will know that we reported this issue earlier with the free version of ChatGPT using GPT-3.5 (covered in the WSJ). We have updated this new article by running the same prompts through the paid version using GPT-4. Did the problems go away with the more powerful LLM?

The error rate went down slightly, but our two main results held up. It’s important that any fake citations at all are being presented as real. The proportion of nonexistent citations was over 30% with GPT-3.5, and it is over 20% with our trial of GPT-4 several months later. See figure 2 from our paper below for the average accuracy rates. The proportion of real citations is always under 90%. GPT-4, when asked about a very specific narrow topic, hallucinates almost half of the citations (57% are real for level 3, as shown in the graph).

The second result from our study is that the error rate of the LLM increases significantly when the prompt is more specific. If you ask GPT-4 about a niche topic for which there is less training data, then a higher proportion of the citations it produces are false. (This has been replicated in different domains, such as knowledge of geography.)

What does Joy Buchanan really think?: I expect that this problem with the fake citations will be solved quickly. It’s very brazen. When people understand this problem, they are shocked. Just… fake citations? Like… it printed out reference for papers that do not actually exist? Yes, it really did that. We were the only ones who quantified and reported it, but the phenomenon was noticed by millions of researchers around the world who experimented with ChatGPT in 2023. These errors are so easy to catch that I expect ChatGPT will clean up its own mess on this particular issue quickly. However, that does not mean that the more general issue of hallucinations is going away.

Not only can ChatGPT make mistakes, as any human worker can mess up, but it can make a different kind of mistake without meaning to. Hallucinations are not intentional lies (which is not to say that an LLM cannot lie). This paper will serve as bright clear evidence that GPT can hallucinate in ways that detract from the quality of the output or even pose safety concerns in some use cases. This generalizes far beyond academic citations. The error rate might decrease to the point where hallucinations are less of a problem than the errors that humans are prone to make; however, the errors made by LLMs will always be of a different quality than the errors made by a human. A human research assistant would not cite nonexistent citations. LLM doctors are going to make a type of mistake that would not be made by human doctors. We should be on the lookout for those mistakes.

ChatGPT is great for some of the inputs to research, but it is not as helpful for original scientific writing. As prolific writer Noah Smith says, “I still can’t use ChatGPT for writing, even with GPT-4, because the risk of inserting even a small number of fake facts… “

Follow-Up Research: Will Hickman and I have an incentivized experiment on trust that you can read on SSRN: Do People Trust Humans More Than ChatGPT?

@IMurtazashvili has pointed me to a great resource for AI-era literature review work. “AI-Based Literature Review Tools” from Texas A&M

Axios Survey of Americans on AI Regulation

Axios just surveyed over 2,000 U.S. adults to find that “Americans rank the importance of regulating AI below government shutdowns, health care, gun reform…” Without pressure from the public to pass new legislation, Congress might do nothing for now, which will lead to the rapid advance of LLM chatbots into life and work.

The participants seem more worried about AI taking jobs than they are excited about AI making life better. There is some concern about misinformation.** So, they don’t think AI will have no impact on society, but they also don’t see enacting regulation as a top priority for government.

At my university, the policy realm I know best, we will probably not be “regulating” AI. We have had task forces talking about it, so it’s not because no one has considered it.

The Axios poll revealed gender gaps in attitudes toward AI. Women said they would be less permissive about kids using AI than men. Also, “Parents in urban areas were far more open to their children using AI than parents in the suburbs or rural areas.” Despite those gaps in what people say, I expect that the gaps in what their children are currently doing with technology are smaller. Experimental economists are suspicious of self-reported data.

**Our results did not change much when we ran the exact same prompts through GPT-4. A version of my paper on AI errors that I blogged about before is up on SSRN, but a new manuscript incorporating GPT-4 is under review: Buchanan, Joy, Stephen Hill, and Olga Shapoval (2023). “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics”. Working Paper.

Video for new ChatGPT users

Have you not gotten around to trying ChatGPT for yourself yet?

Ethan and Lilach Mollick have released a series of YouTube videos that encapsulate some current insights, aimed at beginners, posted on Aug. 1, 2023. It covers ChatGPT, Bing, and Bard. Everyday free users are using these tools.

Practical AI for Instructors and Students Part 2: Large Language Models (LLMs)

If you are already using ChatGPT, then this video will probably feel too slow. However, they do have some tips that amateurs could learn from even if they have already experimented. E. Mollick says of LLMs “they are not sentient,” but it might be helpful to treat them as if they are. He also recommends thinking of ChatGPT like an “intern” which is also how Mike formulated his suggestion back in April.

  • I used GPT-3.5 a few times this week for routine work tasks. I am not a heavy user, but if any of our readers are still on the fence, I’d encourage you to watch this video and give it a try. Be a “complement” to ChatGPT.
  • I’ll be posting new updates about my own ChatGPT research soon – the errors paper and also a new survey on trust in AI.
  • I hear regular complaints from my colleagues all over the country about poor attempts by college students to get GPT to do their course work. The experiment is being run.
  • Ethan Mollick has been a good Twitter(X) follow for the past year, if you want to keep up with the evolution and study of Large Language Models. https://twitter.com/emollick/status/1709379365883019525
  • Scott wrote this great recent tutorial on the theory behind the tools: Generative AI Nano-Tutorial
  • It was only back in December 2023 that I did a live ChatGPT demonstration in class, and figured that I was giving my students there first ever look at LLMs. Today, I’d assume that all my students have tried it for themselves.
  • In my paper on who will train for tech jobs, I conclude that the labor supply of programmers would increase if more people enjoyed the work. LLMs might make tech jobs less tedious and therefore more fun. If labor supply shifts out, then quantity should increase and wages should fall – good news for innovative businesses.

5 Game Theory Course Changes

I want to share some changes that I’ll make to my game theory course, just for the record. It’s an intense course for students. They complete homeworks, midterm exams, they present scholarly articles to the class, and they write and present a term paper that includes many parts. Students have the potential to learn a huge amount, including those more intangible communication skills for which firms pine.

There is a great deal of freedom in the course. Students model circumstances that they choose for the homeworks, and they write the paper on a topic that they choose. The 2nd half of the course is mathematically intensive. When I’ve got a great batch of students, they achieve amazing things. They build models, they ask questions, they work together. BUT, when the students are academically below average, the course much less fun (for them and me). We spend way more time on math and way less time on the theory and why the math works or on the applicable circumstances. All of that time spent and they still can’t perform on the mathematical assignments. To boot, their analytical production suffers because of all that low marginal product time invested in math. It’s a frustrating experience for them, for me, and for the students who are capable of more.

This year, I’m making a few changes that I want to share.

  1. Minimal Understanding Quizzes: All students must complete a weekly quiz for no credit and earn beyond a threshold score in order to proceed to the homework and exams. I’m hoping to stop the coasters from getting ‘too far’ in the course without getting the basics down well enough. The quizzes must strike the balance of being hard enough that students must know the content, and easy enough that they don’t resent the requirement.
Continue reading

Malwarebytes Poll: Public Fascination with ChapGPT Has Turned to Suspicion

ChatGPT and related AI have been all the rage these past few months. Among other things, “AI” became the shiny object that companies have dangled before investors, rocketing upward the shares of the “Magnificent Seven” large tech stocks.

However, a recent poll by computer security firm Malwarebytes notes a marked turn in the public’s attitude towards these products:

It seems the lustre of the chatbot-that’s-going-to-change-everything is starting to fade….

When people explored its capabilities in the days and weeks after its launch, it seemed almost miraculous—a wonder tool that could do everything from creating computer programs and replacing search engines, to writing students’ essays and penning punk rock songs. Its release kick-started a race to disrupt everything with AI, and integrate ChatGPT-like interfaces into every conceivable tech product.

But those that know the hype cycle know that the Peak of Inflated Expectations is quickly followed by the Trough of Disillusionment. Predictably, ChatGPT’s rapid ascent was met by an equally rapid backlash as its shortcomings became apparent….

A new survey by Malwarebytes exposes deep reservations about ChatGPT, with optimism in startlingly short supply. Of the respondents familiar with ChatGPT:

  • 81% were concerned about possible security and safety risks.
  • 63% don’t trust the information it produces.
  • 51% would like to see work on it paused so regulations can catch up.

The concerns expressed in the survey mirror the trajectory of the news about ChatGPT since its introduction in November 2022.

As EWED’s own Joy Buchanan has been pointing out specifically with regard to citations for research papers (here, here, and in the Wall Street Journal), ChatGPT tends to “hallucinate”, i.e., to report things that are not simply true. In the recent working paper “GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics” , she warns of the possibility of a vicious spiral of burgeoning falsehoods, where AI-generated errors which are introduced into internet content such as research papers are then picked up as “learning” input into the next generation of AI training.

Real-world consequences of ChatGPT’s hallucinations are starting to crop up. A lawyer has found himself in deep trouble after filing an error-ridden submission in an active court case. Evidently his assistant, also an attorney, relied on ChatGPT which came up with a raft of “citations to non-existent cases.” Oops.

And now we have what is believed to be “the first defamation lawsuit against artificial intelligence.” Talk show host Mark Walters filed a complaint in Georgia which is:

…asking for a jury trial to assess damages after ChatGPT produced a false complaint to a journalist about the radio host.  The faux lawsuit claimed that Mr. Walters, the CEO of CCW Broadcast Media, worked for a gun rights group as treasurer and embezzled funds.

…Legal scholars have split on whether the bots should be sued for defamation or under product liability, given it’s a machine — not a person — spreading the false, hurtful information about people.

The issue arose when an Australian mayor threatened to sue the AI company this year over providing false news reports that he was guilty of a foreign bribery scandal.

Wow.

Thousands of AI experts and others have signed an open letter asking: “Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization?”.    The letter states that “Such decisions must not be delegated to unelected tech leaders. Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable.”    It therefore urges  “all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4…If such a pause cannot be enacted quickly, governments should step in and institute a moratorium…AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts.”

Oh, the irony. There is reason to believe that our Stone Age ancestors were creating unreal images like this:

whilst stoned on peyote, shrooms, or oxygen deprivation. And here we are in 2023, with the cutting edge in information technology, running on the fastest specially-fabricated computing devices, and we get…hallucinations.

New Paper with Evidence that ChatGPT Hallucinates Nonexistent Citations

I posted a new working paper with systematic evidence for false citations when ChatGPT (GPT-3.5) writes about academic literature.

Buchanan, Joy and Shapoval, Olga, GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics (June 3, 2023). Available at SSRN: https://ssrn.com/abstract=4467968 or http://dx.doi.org/10.2139/ssrn.4467968

Abstract: We create a set of prompts from every Journal of Economic Literature (JEL) topic to test the ability of a GPT-3.5 large language model (LLM) to write about economic concepts. For general summaries, ChatGPT can perform well. However, more than 30% of the citations suggested by ChatGPT do not exist. Furthermore, we demonstrate that the ability of the LLM to deliver accurate information declines as the question becomes more specific. This paper provides evidence that, although GPT has become a useful input to research production, fact-checking the output remains important.

Figure 2 in the paper shows the trend that the proportion of real citations goes down as the prompt becomes more specific. This idea has been noticed by other people, but I don’t think it has been documented quantitatively before.

We asked ChatGPT to cover a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity.

Level 1: The first prompt, using A here as an example, was “Please provide a summary of work in JEL category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

The paper is only 5 pages long, but we include over 30 pages in the appendix of the GPT responses to our prompts. If you are an economist who has not yet played with ChatGPT, then you might find it useful to scan this appendix and get a sense of what GPT “knows” about varies fields of economics.

If SSRN isn’t working for you, here is Also a Google Drive link to the working paper: https://drive.google.com/file/d/1Ly23RMBlim58a7CbmLwNL_odHSNRjC1L/view?usp=sharing

Previous iterations of this idea on EWED:

https://economistwritingeveryday.com/2023/04/17/chatgpt-as-intern/ Mike’s thoughts on what the critter is good for.

https://economistwritingeveryday.com/2023/01/21/chatgpt-cites-economics-papers-that-do-not-exist/  This is one of our top posts for traffic in 2023, since this is a topic of interest to the public.  That was January of 2023 and here we are in June today. It’s very possible that this problem will be fixed soon. We can log this bug now to serve as a benchmark of progress.

A check in and comparison with Bing:

Self-Replicating Machines: A Practical Human Response

Currently, we have software that can write software. What about physical machines that can produce physical machines? Indeed, what about machines that can produce other machines without human direction?

First of all, machines-building machines (MBM) still require resources: energy, transportation, time, and other inputs. A well-programmed machine that self-replicates quickly can grow in number exponentially. But where would the machines get the resources that enable self-replication? They’d have to purchase them (or conquer the world sci-fi style). Where would a machine get the resources to make purchases of necessary inputs? The same place that everyone else gets them.

Continue reading