Data Heroes

How often do we hear about “data heroes”? As a data analytics teacher, this just thrills me. Bloomberg reported on the Data Heroes of Covid this week.

One of the terrible things about Covid-19 from the perspective of March 2020 was how little we knew. The disease could kill people. We knew the 34-year-old whistleblower doctor in China had died of it. We knew the disease had caused significant disruption in China and Italy. There were so many horror scenarios that seemed possible and so little data with which to make rational decisions.

The United States has government agencies tasked with collecting and sharing data on diseases. The CDC did not make a strong showing here (would they argue they need more funding?). I don’t know if “fortunately” is the right word here, but fortunately private citizens rose to the task.

The Covid Tracking Project gathers and releases data on what is actually happening with Covid and health outcomes. They clearly present the known facts and update everything as fast as possible. The scientific community and even the government relies on this data source.

Healthcare workers have correctly been saluted as heroes throughout the pandemic. The data heroes volunteering their time deserve credit, too. Lastly, I’d like to give credit to Tyler Cowen for working so hard to sift through research and deliver relevant data to the public.

Third Quarter Check-In: COVID and GDP

How have countries around the world fared so far in the COVID-19 pandemic? There are many ways to measure this, but two important measures are the number of deaths from the disease and economic growth.

Over the past few weeks, major economies have started releasing data for GDP in the third quarter of 2020, which gives us the opportunity to “check in” on how everyone is doing.

Here is one chart I created to try to visualize these two measures. For GDP, I calculated how much GDP was “lost” in 2020, compared with maintaining the level from the fourth quarter of 2019 (what we might call the pre-COVID times). For COVID deaths, I use officially coded deaths by each country through Nov. 15 (I know that’s not the end of Q3, but I think it’s better than using Sept. 30, as deaths have a fairly long lag from infections).

One major caution: don’t interpret this chart as one variable causing the other. It’s just a way to visualize the data (notice I didn’t try to fit a line). Also, neither measure is perfect. GDP is always imperfect, and may be especially so during these strange times. Officially coded COVID deaths aren’t perfect, though in most countries measures such as excess deaths indicate these probably understate the real death toll.

You can draw your own conclusions from this data, and also bear in mind that right now many countries in Europe and the US are seeing a major surge in deaths. We don’t know how bad it will be.

Here’s what I observe from the data. The countries that have performed the worst are the major European countries, with the very notable exception of Germany. I won’t attribute this all to policy; let’s call it a mix of policy and bad luck. Germany sits in a rough grouping with many Asian developed countries and Scandinavia (with the notable exception of Sweden, more on this later) among the countries that have weathered the crisis the best (relatively low death rates, though GDP performance varies a lot).

And then we have the United States. Oddly, the country we seem to fit closest with is… Sweden. Death rates similar to most of Western Europe, but GDP losses similar to Germany, Japan, Denmark, and even close to South Korea. (My groupings are a bit imperfect. For example, Japan and South Korea have had much lower death tolls than Germany or Denmark, but I think it is still useful.)

To many observers, this may seem strange. Sweden followed a mostly laissez-faire approach, while most US states imposed restrictions on movement and business that mirrored Western Europe. Some in the US have advocated that the US copy the approach of Sweden, even though Sweden seems to be moving away from that approach in their second wave.

Counterfactuals are hard in the social sciences. They are even harder during a public health crisis. It’s really hard to say what would have happened if the US followed the approach of Sweden, or if Sweden followed the approach of Taiwan. So I’m trying hard not to reach any firm conclusions. To me, it seems safe to say that in the US, public policy has been largely bad and ineffective (fairly harsh restrictions that didn’t do much good in the end), yet the US has (so far) fared better than much of Europe.

All of this could change. But let’s be cautious about declaring victory or defeat at this point.

Coda on Sweden Deaths

Are the officially coded COVID deaths in Sweden an accurate count? One thing we can look to is excess deaths, such as those reported by the Human Mortality Database. What we see is that Swedish COVID deaths do almost perfectly match the excess deaths (the excess over historical averages): around 6,000 deaths more than expected.

Some have suggested that the high COVID deaths for Sweden are overstated because Sweden had lower than normal deaths in recent years, particularly 2019. This has become known as the “dry tinder” theory, for example as stated in a working paper by Klein, Book, and Bjornskov (disclosure: Dan Klein was one of my professors in grad school, and is also the editor of the excellent Econ Journal Watch, where I have been published twice).

But even the Klein et al. paper only claims that “dry tinder” factor can account for 25-50% of the deaths (I have casually looked at the data, and these seems about right to me). Thus, perhaps in the chart above, we can move Sweden down a bit, bringing them closer to the Germany-Asia-Scandinavia group. Still, even with this correction, Sweden has 2.5x the death rate of Denmark (rather than 5x) and 5x the death rate of Finland (rather than 10x, as with officially coded deaths).

As with all things right now, we should reserve judgement until the pandemic is over (Sweden’s second wave looks like it could be pretty bad). The “dry tinder” factor (a term I personally dislike) is worth considering, as we all try to better understand the data on how countries have performed in this crisis.

Twitter flags versus censorship

Throughout this semester, I have asked some students in my data analytics class to think about how data is relevant to current events. Undergraduate Jack Brittle wrote this article about data and election news.

Sometimes public attention moves on quickly after an election is over. Today, on November 15th, voting and messaging is still being debated. It was a month ago on October 14th that Twitter locked up the digital platform of the New York Post, a right-leaning newspaper.

This was an important development in the debate about whether tech companies have the authority to censor posts written by users.

Twitter initially said that linking to the Post stories violated the social-media company’s policies against posting material that contains personal information and is obtained via hacking. As the story broke, Twitter began preventing users from tweeting the stories. Twitter locked the Post’s account, saying it would be unlocked only after it deleted earlier tweets that linked to the stories.” (Wall Street Journal). Twitter suspended a major American newspaper. This move is viewed by some as a direct threat to the freedom of the press. Twitter and other major tech companies came under fire for their ability to manipulate and control media. After major pressure and backlash, Twitter released the account back to the New York Post. “Twitter on Friday unlocked the New York Post’s Twitter account, ending a stalemate between the social-media company and the newspaper stemming from the latter’s publication of stories it said were based on documents obtained from the laptop of Hunter Biden. “We’re baaaaaaack,” the Post’s Twitter account tweeted on a Friday afternoon, just minutes after Twitter said that it was reversing its policies in a way that would allow the Post to be reinstated.” (Wall Street Journal).

It seems that Twitter backed down in that instance. The fundamental question has not been resolved. Should Big Tech censor material on their platforms?

First, there is a school of thought that believes Twitter has the right to control the flow of information on its platforms. Companies like Twitter are not breaking any laws by doing this. Do they not have the right to support and defend certain social causes? By only allowing users to see certain opinions and facts, Twitter can choose to support different policies. It’s not laws but our expectation of media that leads to controversy. Twitter should allow a free flow of information in order to create an open marketplace of ideas.

However, a new difficulty arises because of “fake news”. Now more than ever, media can be manipulated to create certain storylines by nefarious users. According to studies, “fake news” spreads nearly six times faster across digital platforms that real news stories. This leaves Twitter between a rock and a hard place. Do they control information spreading on their site and risk censoring the wrong material, like some consider to be the case of the New York Post article? Or do they take a hands-off approach, allowing all stories to have a place in the arena?

These giant tech firms have unprecedented power. Not only are they gaining massive amounts of data about people and firms, they also have the unique ability to shape their users’ outlook on a variety of ideas and events. These data giants are struggling with how to manage these capabilities and will no doubt continue to update and reform policy.

A example of evolving policies is the treatment of President Trump since November 3rd. Since election day, Donald Trump, has tweeted challenges to official vote counts. Trump has not only claimed voter fraud but also claimed he has won states where vote counts favor Joe Biden. Twitter has since developed a flagging system that adds a note on any tweet that Twitter deems misleading. Instead of censoring the president by locking the entire account, there are flags warning about disinformation. This system seems to be an improvement over previous ways Twitter has handled misleading information. It allows users to see all information but also be warned about potentially questionable information. I expect these policies to continue to evolve as tech companies grapple with the difficult task of managing the flow of information.

When will computers accurately predict elections?

Why can computers beat humans at chess but not predict election outcomes with great precision? Experts in 2020 mostly forecasted that Biden would win by a large enough margin to avoid the kind of quibbling and recounts we are now seeing. I don’t write this as a criticism of the high-profile clever Nate Silver, or any other forecaster. I’m thinking through it as a data scientist.

First, consider a successful application of modern data mining. How did AlphaZero “learn” to play chess? It generated millions of hypothetical games and decided to use the strategies that looked successful ex-post. AlphaZero has excellent data and lots of it.

If we think about actual election outcomes, there aren’t enough observations to expect accurate forecasts. If each presidential election is one observation, then there have only been about 50 since the founding hundreds of years ago. No data scientist would want to work with 50 data points.

You can’t say “in the years when ‘defund the police!’ was associated with Democrats, the GOP presidential candidate gained among married women”. There has only ever been one presidential election when that occurred. Judging by what I have been observing of the DNC post-mortem on Twitter in the past week, that might not happen again. See this tweet for example:

I know very little about political analysis. Only from what I know about data science, I would imagine that computers will get better at predicting the outcomes of races for the House of Representatives.

House representatives serve 2-year terms. There are over 400 House elections every 2 years.

Think about this over one decade of American history. There are actually more than 400 representatives in the house, but let’s imagine a “Shelter” of Reps with 400 members for ease of calculation.

In one decade, there are usually two presidential elections. That means we get 2 observations to learn from. In the same decade, there would be 400×5 “Shelter” elections. That yields 2,000 observations, which is considered respectable for the application of data mining methods.

One application of such a forecasting machine would be to determine which slogans are the most likely to lead to success.

Introducing Students to Text Mining

I’m going to teach text mining in the upcoming week. Most of my students have never heard of it. We have spent the semester talking about what do to with structured data, which includes some of the basic concepts from traditional statistics.

I often ask them to think about what computers can do. We talk about why “data analytics” classes are happening in 2020 and did not happen in 1990. Hardware and software innovations have expanded the boundaries of what computers can do for us.

The gritty details of how text mining works can make for a boring lecture, so I’m going to use the following narrative to get intellectually curious students on board. It always helps to start with fighting Nazis. Alan Turing helps defeat the Nazis by using a proto-computer to crack codes. The same brilliant Turing was smart enough to realize that computer could play chess someday (acknowledgement for me knowing that trivia: Average is Over). Turing didn’t live to see computers beat humans in chess but, in a sense, it didn’t take very long. Only about 50 years later, computers beat humans at chess.

Maybe chess is exactly the kind of thing that is hard for humans and easy for computers. When we discuss basic data mining, I tell students to think about how computers can do simple calculations much faster than humans can. It’s their comparative advantage.

Could Turing ever have imagined that a human seeking customer service from a bank could chat with a bot? Maybe text mining is a big advance over chess, but it only took about one decade longer for a computer (developed by IBM) to beat a human in Jeopardy. Winning Jeopardy requires the computer to get meaning from a sentence of words. Computers have already moved way beyond playing a game show to natural language processing.

How computers make sense of words starts with following simple rules, just as computer do to perform data mining on a spreadsheet of numbers. As I explain those rules to my students this week, I’m hoping that starting off the lecture with fighting Nazis will help them persevere through the algorithms.

Huge Prison Population in the U.S.

During some general reading on finance, I ran across the following two information-rich graphics from Hoya Capital on the U.S. prison population. On the first graph, the blue areas show the absolute numbers, and the green line shows the percent incarceration rate. A rate of 0.5% comes to 500 prisoners per 100,000 population.

This graph shows a huge rise in the state and federal prison population between 1980 and 2000. There seems general agreement that much of that increase in the prison population is due to mandatory sentencing laws, which require relatively long sentences. In particular, “three strikes and you’re out” laws may demand a life sentence for three felony convictions, if at least one of them is for a serious violent crime. Another factor was the increased criminalization of drug use (possession), in addition to drug dealing.

The graphic below shows the particular classes of crimes of which inmates of the state and federal prison systems have been convicted. The largest single category is violent crimes, but other types are significant, such as drug and property crimes, and “public order” crimes. Public order crimes include activities such as prostitution, gambling, alcohol, child pornography, and some drug charges. This graphic also includes the large number of people in local jails, most of whom are imprisoned awaiting trial or sentencing.

The total number of people under legal supervision in the U.S., including probation and parole, is over 6 million:

Source: Wikipedia

The U.S. has by far the largest official prison population in the world, and the highest incarceration rate. The following graph from Wikipedia depicts incarceration rates for several countries or regions as of 2009:

Most developed countries have incarceration rates of around 100-200 per 100,000, which is where the U.S. was in about 1970. The relatively high rate for Russia is attributed in large part to strict “zero tolerance” laws on drugs.

Again, the main driver for the high rates in the U.S. is the long sentences, driven by mandates. Wikipedia notes that there are other countries, including some in Europe, which have higher annual admissions to prison per capita than in the U.S. However, “The typical mandatory sentence for a first-time drug offense in federal court is five or ten years, compared to other developed countries around the world where a first time offense would warrant at most 6 months in jail… The average burglary sentence in the United States is 16 months, compared to 5 months in Canada and 7 months in England.” 

Policy debates on this topic continue. Obviously, we want to protect society from dangerous predators, but the direct and indirect costs to society for this level of incarceration are high. It seems like an area which is ripe for reform of some kind, though I do not claim to have a novel proposal.

Strange overnight switch in 2020 election betting markets

I share this tweet because it provides a good visual of a strange event last night. As results were coming in at night in the US, there was a sudden huge reversal. For months the markets had predicted a Biden win. Throughout the night there was some wild speculation in which some buyers were willing to bet Trump would win. Around the time respectable people start waking up in the US, the market flipped again. I hear some people saying on Twitter that they regret no buying during the night. Near 5am Eastern Time that Trumps chance of winning went back down under 50%. That is also when new information came in showing that Biden would likely win Wisconsin.

At the time I write this, votes are still being counted. It is expected that the ballots still to be counted will mostly give votes to Biden. The “blue wave” did not materialize in 2020. If Joe Biden wins the presidential election, it will not be with the overwhelming mandate that some expected.

Economists often promote using betting markets to get predictions of the future. There are lots of applications beyond politics. These high profile elections bring attention to betting markets. Maybe people will begin relying more on them in other fields. I think betting on temperature increases and rising sea levels would be interesting and useful.

Gen Z on “The Social Dilemma”

Undergraduate student of data analytics, Liza Thornell, writes her reaction to the documentary on social media. My earlier post on the same topic is here.

A new Netflix documentary has recently caught the public’s attention. The Social Dilemma has caused speculation to form around big tech companies and has also sparked the deletion of social media applications across the nation. The Social Dilemma explores the steps that were taken to create the powerhouse social media platforms, along with exposing the negative effects the platforms have on its users.

Social media platforms, such as Facebook, gain the majority of their revenue through advertising. Due to its privacy settings (or lack thereof) Facebook is able to obtain astronomical amounts of data about its users. The data that is collected from social media platforms entices marketers because they can reach more customers for less money. These platforms market specific products based on the data that they collect through their users. According to The Social Dilemma not only can social media market products to you, it can also mold and shape your opinion as well, changing your preferences over time. In the Social Dilemma several former Silicon Valley executives explain why this can be bad.

According to the documentary, social media is harmful due to its design and how it affects mental health. The purpose of these apps is to hold your attention captive for as long as possible. In order to do so, it is important that the app keeps providing fresh new content that is relevant to your preferences. Everything we do on our phones becomes predictive data for what we will do next. The ideal situation for social media platforms is that you will end up down the ‘rabbit hole’ of specifically tailored content because the longer you spend scrolling and clicking through social media, the more data these platforms can collect about you. According to the Social Dilemma, the rabbit hole is a dangerous place to be.

Social media in its initial design was not created to be hurtful. It was made to bring people together. The Social Dilemma states that big tech companies have strayed away from their initial creation story. Now, social media is about holding its audience captive, molding public opinion, and increasing sales through tailoring marketing. The content is distracting us from the serious ramifications that can come from spending all of our free time scrolling instead of engaging with the world around us. It makes us crave instant approval at all times, from people and the media.

The Social Dilemma does not condemn social media. This documentary supports social media, just not the way it’s currently working. The biggest take away from this documentary is that we have gotten lost in the rabbit hole and have to find our way out to preserve our privacy.

From the standpoint of a college student, I found this documentary to be eye opening. I believe that the average college student struggles with being addicted to social media platforms and our attention spans are shortening as a result. Reliance on social media directly correlates with the decline of in-person interaction and interpersonal communication skills. Companies such as Apple have released features on phones that track screen time in order to enlighten consumers on their usage. The ability to track usage is a tool for managing and limiting social media use. For college students, cutting down on social media consumption can positively impact mental health and productivity.

Election Forecast by 538

I teach a data analytics course and I asked some students to write blogs on data and current events. This blog is by Jake Fischer.

Every four years, the United States seems to turn upside down with the Presidential election. Now, the nation has turned its eyes to predictive analytics to understand the future of our country. As of October 22, the time of the writing of this post, Joe Biden stands an 87% chance of winning the critical swing state of Florida. This seems like a significant margin, but how did we come to this understanding using data? How reliable is this fivethirtyeight forecast?

For starters, the 87% chance of winning is based on a simulation run by data analysts in 40,000 different scenarios, all of which are measuring different factors from voter turnout to demographics to the economic forecast of the day. This prediction also factors in the polling averages for each candidate from 8 different polls, each of which is given a grade of reliability and weighted accordingly. Hundreds of factors come into play when predicting an election, yet confidence in many of these numbers is at an all-time low. So, in answer to question two, the outlook is anything but certain.

This doubtful outlook is because, although Biden wins 87% of the elections, this does not factor in the margin he wins by. When truly looking at the data, you see that over half of the outcomes weighed in this 87% are decided by less than 1% of votes. Unfortunately, this does not leave much more for a margin of error as is required in most data analysis. 

This very popular website does not factor in the impact that the website itself has on voters. With millions of people reading this data and seeing that Biden stands a 87% chance of winning, there is a high likelihood that voters will simply not turn up at the polls. This distinct percentage of voter turnout that may chose not to turn up at the polls because of analytics like this, would significantly impact the data set and could actually throw the results in the entire opposite direction, particularly when the decision is already being decided by such a slim margin.

Even though data analysis has turned into a booming industry, with more accurate results than ever before, there are some instances in which predictive analytics has placed significant limitations on the outcome of important decisions, such as the presidential election. I say all of this to not place doubt on analytics, nor the credibility of the FiveThirtyEight organization, but rather to remind readers of the important factor that is the human condition. At the end of the day it is important to exercise your right to vote no matter what side of the aisle you stand on, and without allowing polling data to influence your decisions. Vote!

How Bayesians Read a Think Piece

How likely is it that an opinion critical of [topic] will get expressed by someone on the internet?

My good friend (call her Anne) texted me this week. Anne sent me a link to a blog that declared some of her preferred works of art (i.e. musicals) to be inferior. She loves art, so to be told that her tastes were not exceptionally good was disappointing.

In my reply I wanted to make sure that Anne wasn’t putting too much weight on this new evidence:

How should we incorporate blogs into our beliefs about reality? (I see the irony – I’m writing a blog right now.)

The non-technical summary: you should be skeptical of what you read online.

The technical summary: the fact that some writer said “H” on the internet, should make you only slightly more confident that “H” is true.

I can’t improve on the Wikipedia presentation of Bayes’ theorem, so I’ll just paste in:

Let’s consider the probability that it is true that Anne’s favorite musical is bad. We’ll call that hypothesis “H”. What’s the probability of H, given that one person wrote an article stating that the musical is bad?  

The evidence, E, is the article.

Instead of just evaluating whether the article is convincing or not, Bayesian inference requires that we consider

  1. Were we confident that H was true BEFORE seeing the article? Was there good data up until this point that convinced us H is true?
  2. If H is true, what’s the probability of this article being written?
  3. What’s the overall probability of this article being written, regardless of whether H is true?

The probability that musical is bad given that someone wrote an article saying so is :

P(H|E) = P(bad|article)

P(bad|article) = ( P(article|bad) x P(bad) )/ P(article)

The right side of the equation asks whether we are likely to see the article if the musical is bad. If the musical is actually bad, then we are likely to see it condemned in print. HOWEVER, if we had a prior belief that the musical is not bad, then the numerator gets smaller.

Finally, we consider the denominator, P(E) or the probability of seeing an article that is derogatory towards the musical. If that probability is high, then the probability of the musical actually being bad goes down.

Here’s how Anne should think:

P(bad|article) = ( likely that article will be written if bad x prior evidence suggests not bad) / snobby think pieces get written regardless

so

P(bad|article) = (big x small)/ big = small probability that Anne’s favorite musical is actually bad

You should be just the right amount of skeptical when it comes to internet content. Be Bayesian.