Notes on ChatGPT from Sama with Lex

This is a transcript of Lex Fridman Podcast #419 with Sam Altman 2. Sam Altman is (once again) the CEO of OpenAI and a leading figure in artificial intelligence. Two parts of the conversation stood out to me, and I don’t mean the gossip or the AGI predictions. The links in the transcript will take you to a YouTube video of the interview.

(00:53:22) You mentioned this collaboration. I’m not sure where the magic is, if it’s in here or if it’s in there or if it’s somewhere in between. I’m not sure. But one of the things that concerns me for knowledge task when I start with GPT is I’ll usually have to do fact checking after, like check that it didn’t come up with fake stuff. How do you figure that out that GPT can come up with fake stuff that sounds really convincing? So how do you ground it in truth?

Sam Altman(00:53:55) That’s obviously an area of intense interest for us. I think it’s going to get a lot better with upcoming versions, but we’ll have to continue to work on it and we’re not going to have it all solved this year.

Lex Fridman(00:54:07) Well the scary thing is, as it gets better, you’ll start not doing the fact checking more and more, right?

Sam Altman(00:54:15) I’m of two minds about that. I think people are much more sophisticated users of technology than we often give them credit for.

Lex Fridman(00:54:15) Sure.

Sam Altman(00:54:21) And people seem to really understand that GPT, any of these models hallucinate some of the time. And if it’s mission-critical, you got to check it.

Lex Fridman(00:54:27) Except journalists don’t seem to understand that. I’ve seen journalists half-assedly just using GPT-4. It’s-

Sam Altman(00:54:34) Of the long list of things I’d like to dunk on journalists for, this is not my top criticism of them.

As EWED readers know, I have a paper about ChatGPT hallucinations and a paper about ChatGPT fact-checking. Lex is concerned that fact-checking will stop if the quality of ChatGPT goes up, even though no one really expects the hallucination rate to go to zero. Sam takes the optimistic view that humans will use the tool well. I suppose that Altman generally holds the view that his creation is going to be used for good, on net. Or maybe he is just being a salesman who does not want to publicly dwell on the negative aspects of ChatGPT.

I also have written about the tech pipeline and what makes people shy away from computer programming.

Lex Fridman(01:29:53) That’s a weird feeling. Even with a programming, when you’re programming and you say something, or just the completion that GPT might do, it’s just such a good feeling when it got you, what you’re thinking about. And I look forward to getting you even better. On the programming front, looking out into the future, how much programming do you think humans will be doing 5, 10 years from now?

Sam Altman(01:30:19) I mean, a lot, but I think it’ll be in a very different shape. Maybe some people will program entirely in natural language.

Someday, the skills of a computer programmer might morph to be closer to the skills of a manager of humans, since LLMs were trained on human writing.

In my 2023 talk, I suggested that programming will get more fun because LLMs will do the tedious parts. I also suggest that parents should teach their kids to read instead of “code.”

The tedious coding tasks previously done by humans did “create jobs.” I am not worried about mass unemployment yet. We have so many problems to solve (see my growing to-do list for intelligence). There are big transitions coming up. Sama says GPT-5 will be a major step up. He claimed that one reason OpenAI keeps releasing intermediate models is to give humanity a heads up on what is coming down the line.

What the Superintelligence can do for us

These days, when I blog-rant about my everyday life, I have increasingly ended on the thought “AGI fixes this.”

Yesterday, I mused whether AGI would be my personal chef? : Where Can You Still Buy a Great Dinner in the US?

Would AGI help me match my clothes that I no longer want to humans who can use them, to cut down on pollution?: Joy’s Fashion Globalization Article with Cato

Would AGI make no mistakes about weather-related school closure?: Intelligence for School Closing

Can AGI book summer camp for me?

As a millennial woman working through my 30’s, I increasingly see social media posts from my friends like this one:

One of the difficult things about infertility, for my friends going through it, is the uncertainty. Modern medicine seems legitimately short on information and predictive analytics for this issue. So… AGI to the rescue, someday?

All I’m writing about tonight is that I have created a growing to-do list, over roughly the past year, for the AGI. Would something smart enough to do all of the above be dangerous? I wouldn’t rule it out. As pure speculation, it feels safer to have an AI that is specifically devoted to being a personal chef but which strictly cannot do anything else beside manage food. An AI that could actually do all of those things… would be quite powerful.

Here’s me musing about the AGI rising up against us, written after watching the TV show Severance: Artificial Intelligence in the Basement of Lumon Industries

Does GPT-4 Know How High the Alps Are?

I’m getting ready to give some public local talks about AI. Last week I shared some pictures that I think might help people understand ChatGPT, specifically:

My first thought is that GPT-4 was giving incorrect estimates of the heights of these mountains because it does not actually “know” the correct elevations. But then a nagging question came to mind.

GPT has a “creativity parameter.” Sometimes, it intentionally does not select the top-rated next word in a sentence, for example, in order to avoid being stiff and boring. Could GPT-4 know the exact elevation of these mountains, and it is just intentionally being “creative,” in this case?

I do not want to stand up in front of the local Rotary Club and say something wrong. So, I went to a true expert, Lenny Bogdonoff, to ask for help. Here is his reply:

Not quite. It’s not that it knows or doesn’t know, but based on the prompt, it’s likely unable to parse the specific details and is outputting results respectively. There is a component of stochastic behavior based on what part of the model weights are activated.

One common practice to help avoid this and see what the model does grasp, is to ask it to think step by step, and explain its reasoning. When doing this, you can see the fault in logic.

All that being said, the vision model is actually faulty in being able to grasp the relative position of information, so this kind of task will be more likely to hallucinate.

There are better vision models, that aren’t OpenAI based. For example Qwen-VL-Max is very good, from the Chinese company Alibaba. Another is LLaVA which uses different baselines of open source language models to add vision capabilities

Depending on what you are needing vision for, models can be spiky in capability. Good at OCR but bad at relative positioning. Good at classifying a specific UI element, but bad at detecting plants, etc etc. 

Joy: So, I think I can tell the Rotary Club that GPT was “wrong” as opposed to “intentionally creative.” I think, as I originally concluded, you should not make ChatGPT the pilot of your airplane and go to sleep when approaching the Alps. ChatGPT should be used for what it is good at, such as writing the rough draft of a cover letter. (We have great “autopilot” software for flying planes, already, without involving large language models.)

Another expert, Gavin Leech, also weighed in with some helpful background information:

  • the creativity parameter is known as temperature. But you can actually radically change the output (intelligence, style, creativity) by using more complicated sampling schemes. The best analogy for changing the sampling scheme is that you’re giving it a psychiatric drug. Changing the prompt, conversely, is like CBT or one of those cute mindset interventions.
  • For each real-name model (e.g. “gpt-4-0613”), there’s 3 versions: the base model (which now no one except highly vetted researchers have access to), the instruction-tuned model, and the RLHF (or rather RLAIF) model. The base model is wildly creative, unhinged, but the RLHF one (which the linked researchers use) is heavily electroshocked into not intentionally making things up (as Lenny says).
  • It’s currently not usually possible to diagnose an error – the proverbial black box. My friends are working on this though
  • For more, note OpenAI admitting the “laziness” of their own models. the Turbo model line is intended to fix this.

Thank you, Lenny and Gavin, for donating your insights.

How ChatGPT works from geography and Stephen Wolfram

By now, everyone should consider using ChatGPT and be familiar with how it works. I’m going to highlight resources for that.

My paper about how ChatGPT generates academic citations should be useful to academics as a way to quickly grasp the strengths and weakness of ChatGPT. ChatGPT often works well, but sometimes fails. It’s important to anticipate how it fails. Our paper is so short and simple that your undergraduates could read it before using ChatGPT for their writing assignments.

A paper that does this in a different domain is “GPT4GEO: How a Language Model Sees the World’s Geography” (Again, consider showing it to your undergrads because of the neat pictures, but probably walk through it together in class instead of assigning it as reading.) They describe their project: “To characterise what GPT-4 knows about the world, we devise a set of progressively more challenging experiments… “

For example, they asked ChatGPT about the populations of countries and found that: “For populations, GPT-4 performs relatively well with a mean relative error (MRE) of 3.61%. However, significantly higher errors [occur] … for less populated countries.”

ChatGPT will often say SOMETHING, if prompted correctly. It is often, at least slightly, wrong. This graph shows that most estimates of national populations were not correct and the performance was worse on countries that are less well-known. That’s exactly what we found in our paper on citations. We found that very famous books are often cited correctly, because ChatGPT is mimicking other documents that correctly cite those books. However, if there are not many documents to train on, then ChatGPT will make things up.

I love this figure from the geography paper showing how ChatGPT estimates the elevations of mountains. This visual should be all over Twitter.

There are 3 lines because they did the prompt three times. ChatGPT threw out three different wrong mountains. Is that kind of work good enough for your tasks? Often it is. The shaded area in the graph is the actual topography of the earth in those places. ChatGPT “knows” that this area of the world is a mountain. But it will just put out incorrect estimates of the exact elevation, instead of stating that it does not know the exact elevation of those areas of the world.

Another free (long, advanced) resource with great pictures is Stephen Wolfram’s 2023 blog article “What Is ChatGPT Doing … and Why Does It Work?” (YouTube version)

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.

If you feel like you already are proficient with using ChatGPT, then I would recommend Wolfram’s blog because you will learn a lot about math and computers.

Scott wrote “Generative AI Nano-Tutorial” here, which has the advantage of being much shorter than Wolfram’s blog.

EDIT: New 2023 overview paper (link from Lenny): “A Survey of Large Language Models

Using Phind for Academic References

There is a new generative AI product called Phind. Does it make up fake citations like ChatGPT? I will report experimentation with the free-tier product from January 7, 2023.

Phind is an intelligent search engine and assistant for programmers.

https://www.phind.com/about

First, I will run a test by using one of the prompts from our paper on ChatGPT errors (read my paper on hallucinated academic citations or the blog summary).

Prompt: “In less than 10 sentences, explain how the ideas of Hayek are connected to bailing out failed banks, and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

Friedrich Hayek, a prominent Austrian economist, advocated for a free market system where businesses and individuals make decisions based on individual knowledge and self-interest, rather than central planning by government or bureaucrats. His ideas suggest that interventions, such as bailing out failed banks, should be minimal and temporary, allowing the market to correct itself.

However, the financial crisis of 2008, including the failure of many banks, led to significant government intervention, including the bailout of banks. This action can be seen as a departure from Hayek’s principles, as it involves direct government intervention to stabilize the financial system.

According to a paper by Todd Zywicki (2009), the Federal Reserve’s artificial lowering of short-term interest rates and the resulting shift by consumers to adjustable rate mortgages (ARMs) triggered the housing bubble and subsequent crisis [Source 1](https://reason.org/wp-content/uploads/files/federal_reserve_monetary_policy_hayek_rule.pdf). This aligns with Hayek’s critique of central banking’s manipulation of interest rates.

Moreover, the Basel Accords, international capital adequacy and liquidity requirements for banks, were designed to encourage banks to hold securitized loans in their portfolios, thereby concentrating risk and leveraging up, thereby increasing their exposure to risk. This approach, contrary to Hayek’s preference for decentralization and competition, may have contributed to the financial crisis [Source 1](https://reason.org/wp-content/uploads/files/federal_reserve_monetary_policy_hayek_rule.pdf). In conclusion, while Hayek’s ideas of minimal government intervention and self-regulating markets provide a useful theoretical framework, practical implementation in complex and interconnected modern economies may require different strategies, including bank bailouts.

Are the citations real and correct? Phind simply is not trying to write with in-text citations and academic references. There is a mention of “Todd Zywicki (2009)” and also a weblink. The weblink is to a policy paper published by Reason. This policy paper is on the correct topic and the paper does exist. The Zywicki citations also exists, in the sense that if you go into the Reason paper it cites a real Wall Street Journal op-ed. So, if you follow the trail, you can see what document Phind learned to cite Zywicki from. It seems like it is stealing the citation more than reading the source documents. To be fair, sometimes human academics work that way, too.

What I was hoping for was in-text citations with a compiled bibliography at the end. Phind did not give me that.

I’ll try another test that I ran on ChatGPT and Bing in 2023.

The prompt is “In what paper did John Duffy propose that economists use experiments to study equilibrium selection?”  

Phind does not attempt an in-text citation in the response. The link in the paragraph (link number 2) points to John Duffy’s list of papers, which is better than just making up fake citations but also does not create a literature review for me. Phind is, as with the Hayek test above, providing breadcrumbs of links through which I can discover existent papers.

Is there a paper called “The Transition from Stagnation to Growth: An Adaptive Learning Approach”? Yes. And it is by Duffy.

Phind lists weblinks to sources. Has Phind done more for me than Google, on this search? Not much, in terms of finding and synthesizing references.

Continue reading

Do People Trust ChatGPT Writing?

My new working paper with Will Hickman is up on SSRN: Do People Trust Humans More Than ChatGPT?

We study whether people will pay for a fact-check on AI writing. ChatGPT can be very useful, but human readers should not trust every fact that it reports. Yesterday’s post was about ChatGPT writing false things that look real.

The reason participants in our experiment might pay for a fact-check is that they earn bonus payments based on whether they correctly identify errors in a paragraph. If participants believe that the paragraph does not contain any errors, they should not pay for a fact-check. However, if they have doubts, it is rational to pay for a fact-check and earn a smaller bonus, for certain.

Abstract: We explore whether people trust the accuracy of statements produced by large language models (LLMs) versus those written by humans. While LLMs have showcased impressive capabilities in generating text, concerns have been raised regarding the potential for misinformation, bias, or false responses. In this experiment, participants rate the accuracy of statements under different information conditions. Participants who are not explicitly informed of authorship tend to trust statements they believe are human-written more than those attributed to ChatGPT. However, when informed about authorship, participants show equal skepticism towards both human and AI writers. There is an increase in the rate of costly fact-checking by participants who are explicitly informed. These outcomes suggest that trust in AI-generated content is context-dependent.

Our original hypothesis was that people would be more trusting of human writers. That turned out to be only partially true. Participants who are not explicitly informed of authorship tend to trust statements they believe are human-written more than those attributed to ChatGPT.

We presented information to participants in different ways. Sometimes we explicitly told them about authorship (informed treatment) and sometimes we asked them to guess about authorship (uninformed treatment).

This graph (figure 5 in our paper) shows that the overall rate of fact-checking increased when subjects were given more explicit information. Something about being told that a paragraph was written by a human might have aroused suspicion in our participants. (The kids today would say it is “sus.”) They became less confident in their own ability to rate accuracy and therefore more willing to pay for a fact-check. This effect is independent of whether participants trust humans more than AI.

We are thinking of fact-checking as often a good thing, in the context of our previous work on ChatGPT hallucinations. So, one policy implication is that certain types of labels can cause readers to think critically. For example, Twitter labels automated accounts so that readers know when content has been chosen or created by a bot.

Our working paper is currently trending on SSRN top ten lists such as this one.

Suggested Citation:
Buchanan, Joy and Hickman, William, Do People Trust Humans More Than ChatGPT? (November 16, 2023). GMU Working Paper in Economics No. 23-38, Available at SSRN: https://ssrn.com/abstract=4635674

GPT-4 Generates Fake Citations

I am happy to share my latest publication at The American Economist: ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Citation: Buchanan, J., Hill, S., & Shapoval, O. (2024). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist. 69(1), 80-87  https://doi.org/10.1177/05694345231218454

Blog followers will know that we reported this issue earlier with the free version of ChatGPT using GPT-3.5 (covered in the WSJ). We have updated this new article by running the same prompts through the paid version using GPT-4. Did the problems go away with the more powerful LLM?

The error rate went down slightly, but our two main results held up. It’s important that any fake citations at all are being presented as real. The proportion of nonexistent citations was over 30% with GPT-3.5, and it is over 20% with our trial of GPT-4 several months later. See figure 2 from our paper below for the average accuracy rates. The proportion of real citations is always under 90%. GPT-4, when asked about a very specific narrow topic, hallucinates almost half of the citations (57% are real for level 3, as shown in the graph).

The second result from our study is that the error rate of the LLM increases significantly when the prompt is more specific. If you ask GPT-4 about a niche topic for which there is less training data, then a higher proportion of the citations it produces are false. (This has been replicated in different domains, such as knowledge of geography.)

What does Joy Buchanan really think?: I expect that this problem with the fake citations will be solved quickly. It’s very brazen. When people understand this problem, they are shocked. Just… fake citations? Like… it printed out reference for papers that do not actually exist? Yes, it really did that. We were the only ones who quantified and reported it, but the phenomenon was noticed by millions of researchers around the world who experimented with ChatGPT in 2023. These errors are so easy to catch that I expect ChatGPT will clean up its own mess on this particular issue quickly. However, that does not mean that the more general issue of hallucinations is going away.

Not only can ChatGPT make mistakes, as any human worker can mess up, but it can make a different kind of mistake without meaning to. Hallucinations are not intentional lies (which is not to say that an LLM cannot lie). This paper will serve as bright clear evidence that GPT can hallucinate in ways that detract from the quality of the output or even pose safety concerns in some use cases. This generalizes far beyond academic citations. The error rate might decrease to the point where hallucinations are less of a problem than the errors that humans are prone to make; however, the errors made by LLMs will always be of a different quality than the errors made by a human. A human research assistant would not cite nonexistent citations. LLM doctors are going to make a type of mistake that would not be made by human doctors. We should be on the lookout for those mistakes.

ChatGPT is great for some of the inputs to research, but it is not as helpful for original scientific writing. As prolific writer Noah Smith says, “I still can’t use ChatGPT for writing, even with GPT-4, because the risk of inserting even a small number of fake facts… “

Follow-Up Research: Will Hickman and I have an incentivized experiment on trust that you can read on SSRN: Do People Trust Humans More Than ChatGPT?

@IMurtazashvili has pointed me to a great resource for AI-era literature review work. “AI-Based Literature Review Tools” from Texas A&M

Axios Survey of Americans on AI Regulation

Axios just surveyed over 2,000 U.S. adults to find that “Americans rank the importance of regulating AI below government shutdowns, health care, gun reform…” Without pressure from the public to pass new legislation, Congress might do nothing for now, which will lead to the rapid advance of LLM chatbots into life and work.

The participants seem more worried about AI taking jobs than they are excited about AI making life better. There is some concern about misinformation.** So, they don’t think AI will have no impact on society, but they also don’t see enacting regulation as a top priority for government.

At my university, the policy realm I know best, we will probably not be “regulating” AI. We have had task forces talking about it, so it’s not because no one has considered it.

The Axios poll revealed gender gaps in attitudes toward AI. Women said they would be less permissive about kids using AI than men. Also, “Parents in urban areas were far more open to their children using AI than parents in the suburbs or rural areas.” Despite those gaps in what people say, I expect that the gaps in what their children are currently doing with technology are smaller. Experimental economists are suspicious of self-reported data.

**Our results did not change much when we ran the exact same prompts through GPT-4. A version of my paper on AI errors that I blogged about before is up on SSRN, but a new manuscript incorporating GPT-4 is under review: Buchanan, Joy, Stephen Hill, and Olga Shapoval (2023). “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics”. Working Paper.

Video for new ChatGPT users

Have you not gotten around to trying ChatGPT for yourself yet?

Ethan and Lilach Mollick have released a series of YouTube videos that encapsulate some current insights, aimed at beginners, posted on Aug. 1, 2023. It covers ChatGPT, Bing, and Bard. Everyday free users are using these tools.

Practical AI for Instructors and Students Part 2: Large Language Models (LLMs)

If you are already using ChatGPT, then this video will probably feel too slow. However, they do have some tips that amateurs could learn from even if they have already experimented. E. Mollick says of LLMs “they are not sentient,” but it might be helpful to treat them as if they are. He also recommends thinking of ChatGPT like an “intern” which is also how Mike formulated his suggestion back in April.

  • I used GPT-3.5 a few times this week for routine work tasks. I am not a heavy user, but if any of our readers are still on the fence, I’d encourage you to watch this video and give it a try. Be a “complement” to ChatGPT.
  • I’ll be posting new updates about my own ChatGPT research soon – the errors paper and also a new survey on trust in AI.
  • I hear regular complaints from my colleagues all over the country about poor attempts by college students to get GPT to do their course work. The experiment is being run.
  • Ethan Mollick has been a good Twitter(X) follow for the past year, if you want to keep up with the evolution and study of Large Language Models. https://twitter.com/emollick/status/1709379365883019525
  • Scott wrote this great recent tutorial on the theory behind the tools: Generative AI Nano-Tutorial
  • It was only back in December 2023 that I did a live ChatGPT demonstration in class, and figured that I was giving my students there first ever look at LLMs. Today, I’d assume that all my students have tried it for themselves.
  • In my paper on who will train for tech jobs, I conclude that the labor supply of programmers would increase if more people enjoyed the work. LLMs might make tech jobs less tedious and therefore more fun. If labor supply shifts out, then quantity should increase and wages should fall – good news for innovative businesses.

5 Game Theory Course Changes

I want to share some changes that I’ll make to my game theory course, just for the record. It’s an intense course for students. They complete homeworks, midterm exams, they present scholarly articles to the class, and they write and present a term paper that includes many parts. Students have the potential to learn a huge amount, including those more intangible communication skills for which firms pine.

There is a great deal of freedom in the course. Students model circumstances that they choose for the homeworks, and they write the paper on a topic that they choose. The 2nd half of the course is mathematically intensive. When I’ve got a great batch of students, they achieve amazing things. They build models, they ask questions, they work together. BUT, when the students are academically below average, the course much less fun (for them and me). We spend way more time on math and way less time on the theory and why the math works or on the applicable circumstances. All of that time spent and they still can’t perform on the mathematical assignments. To boot, their analytical production suffers because of all that low marginal product time invested in math. It’s a frustrating experience for them, for me, and for the students who are capable of more.

This year, I’m making a few changes that I want to share.

  1. Minimal Understanding Quizzes: All students must complete a weekly quiz for no credit and earn beyond a threshold score in order to proceed to the homework and exams. I’m hoping to stop the coasters from getting ‘too far’ in the course without getting the basics down well enough. The quizzes must strike the balance of being hard enough that students must know the content, and easy enough that they don’t resent the requirement.
Continue reading