Book Review: Big Data Demystified

Last year, our economics department launched a data analytics minor program. The first class is a simple 2 credit course called Foundations of Data Analytics. Originally, the idea was that liberal arts majors would take it and that this class would be a soft, non-technical intro of terminology and history.

However, it turned out that liberal arts majors didn’t take the class and that the most popular feedback was that the class lacked technical challenge. I’m prepping to teach the class and it will have two components. A Python training component where students simply learn Python. We won’t do super complicated things, but they will use Python extensively in future classes. The 2nd component is still in the vein of the old version of the course.

I’ll have the students read and discuss “Big Data Demystified” by David Stephenson. He spends 12 brief chapters introducing the reader to the importance of modern big data management, analytics, and how it fits into an organization’s key performance indicators. It reads like it’s for business majors, but any type of medium-to-large organization would find it useful.

Davidson starts with some flashy stories that illustrate the potential of data-driven business strategies. For example, Target corporation used predictive analytics to advertise baby and pregnancy products to mothers who didn’t even know that they were pregnant yet. He wets the appetite of the reader by noting that the supercomputers that could play Chess or Go relied on fundamentally different technologies.

The first several chapters of the book excite the reader with thoughts of unexploited potentialities. This is what I want to impress upon the students. I want them to know the difference between artificial intelligence (AI) and machine learning (ML). I want them to recognize which tool is better for the challenges that they might face and to see clear applications (and limitations).

AI uses brute force, iterating through possible next steps. There are multiple online tic-tac-toe AI that keep track records. If a student can play the optimal set of strategies 8 games in a row, then they can get the general idea behind testing a large variety of statistical models and explanatory variables, then choosing the best.

But ML is responsive to new data, according to what worked best on previous training data. There are multiple YouTubers out there who have used ML to beat Super Mario Brothers. Programmers identify an objective function and the ML program is off to the races. It tries a few things on a level, and then uses the training rounds to perform quite well on new levels that it has never encountered before.

There are a couple of chapters in the middle of the book that didn’t appeal to me. They discuss the question of how big data should inform a firm’s strategy and how data projects should be implemented. These chapters read like they are written for MBAs or for management. They were boring for me. But that’s ok, given that Stephenson is trying to appeal to a broad audience.

The final chapters are great. They describe the limitations of big data endeavors. Big data is not a panacea and projects can fail for a variety of what are very human reasons.

Stephenson emphasizes the importance of transaction costs (though he doesn’t say it that way). Medium sized companies should outsource to experts who can achieve (or fail) quickly such that big capital investments or labor costs can be avoided. Or, if internals will be hired instead, he discusses the trade-offs between using open source software, getting locked in, and reinventing the wheel. These are a great few chapters that remind the reader that data scientists and analysts are not magicians. They are people who specialize and can waste their time just as well as anyone else.

Overall, I strongly recommend this book. I kinda sorta knew what machine learning and artificial intelligence were prior to reading, but this book provides a very accessible introduction to big data environments, their possible uses, and organizational features that matter for success. Mid and upper level managers should read this book so that they can interact with these ideas prudentially. Those with a passing interest in programming should read it for greater clarity and to get a better handle on the various sub-fields. Hopefully, my students will read it and feel inspired to be on one side or the other of the manager- data analyst divide with greater confidence, understanding, and a little less hubris.

Everyone’s an Expert: Easy Data Maps in Excel

I love data, I love maps, and I love data visualizations.

While we tend not to remember entire data sets, we often remember some patterns related to rank. Speaking for myself anyway, I usually remember a handful of values that are pertinent to me. If I have a list of data by state, then I might take special note of the relative ranking of Florida (where I live), the populous states, Kentucky (where my parents’ families live), and Virginia (where my wife’s family lives). I might also take special note of the top rank and the bottom rank. See the below table of liquor taxes by State. You can easily find any state that you care about because the states are listed alphabetically.

A ranking is useful. It helps the reader to organize the data in their mind. But rankings are ordinal. It’s cool that Florida has a lower liquor tax than Virginia and Kentucky, but I really care about the actual tax rates. Is the difference big or small? Like, should I be buying my liquor in one of the other states in the southeast instead of Florida? Without knowing the tax rates, I can’t make the economic calculation of whether the extra stop in Georgia is worth the time and hassle. So, the most useful small data sets will have both the ranking and the raw data. Maybe we’re more interested in the rankings, such as in the below table.

But, tables take time to consume. A reader might immediately take note of the bottom and top values. And given that the data is not in alphabetical order, they might be able to quickly pick out the state that they’re accustomed to seeing in print. But otherwise, it will be difficult to scan the list for particular values of interest.  

Continue reading

In Praise of the FRED Excel Add-in

Sometimes, large entities have enough money to throw at a problem that by sheer magnitude they produce something great (albeit at too high a cost). The iPhone app from the FRED is not that thing. But the Excel add-in is something that every macroeconomics professor should consider adding to their toolkit.

Personally, I include links to FRED content in the lecture notes that I provide to students. But FRED makes it easy to do so much more. They now have an add-in that makes accessing data *much* faster. With it, professors can demonstrate in excel their transformations that students can easily replicate. The advantage is that students can learn to access and transform their own data rather than relying on links that I provide them.

The tool is easy enough to find – FRED wants you to use it. After that, the installation is largely automatic.

Installed in excel you will see the below new ribbon option. It’s very user friendly.

Continue reading

Russia, The US, and Crude Data

Overall, I’ve been disappointed with the reporting on the US embargo against Russian oil. The AP reported that the US imports 8% of Russia’s crude oil exports. But then they and other outlets list a litany of other figures without any context for relative magnitudes. Let’s shine some more light on the crude oil data.*

First, the 8% figure is correct – or, at least it was correct as of December of 2021. The below figure charts the last 7 years of total Russian crude oil exports, US imports of Russian crude oil, and the proportion that US imports compose.  That 8% figure is by no means representative of recent history. The average US proportion in 2015-2018 was 7.8%. But the US share as since risen in level and volatility. Since 2019, the US imports compose an average of 11.9% of all Russian crude oil exports.

As an exogenous shock, the import ban on Russian crude oil might have a substantial impact on Russian exports. However, many of the world’s oil importers were already refusing Russian crude. The US ban may not have a large independent effect on Russian sales and may be a case of congress endorsing a policy that’s already in place voluntarily.

Continue reading

Reading Literacy Data

The story that I’ve heard is this:

            In the US, we care about education. We believe that all people should receive one, regardless of their family status. Therefore, states provide education directly.

There you have it. We provide education in the US so that everyone gets a more fair shake at education. We might disagree about the purpose of an education. Maybe it’s for improved job prospects, for a more informed citizenry, or for more unified values and experiences. One socially awkward answer is that state schools are, in part, a childcare service that permit parents to work. Except for these couple of reasons, school provision and compulsory education should, at the very least, increase literacy. That’s a low bar.

Given the above reasoning states began to pass compulsory school legislation. Massachusetts was first in 1852. Followed by DC and Vermont in the 1860s. Thirteen more adopted compulsory education legislation by 1880. By the year 1900, most states had compulsory schooling legislation on the books that was applicable to at least some age groups. See the figure. Thus, did the US achieve more equality, so goes the story.

The reasoning behind the story is sound. Without education of some sort, people will surely have less human capital. The vulnerability of the reasoning is that formal schooling is not the only form of education. A person who doesn’t attend school may help a parent at work or have a private tutor – or simply grow in a milieu of thoughtful exposure. Therefore, requiring that a child attend school may not improve human capital by a degree greater than what the child would have been doing otherwise. That’s an empirical matter.

The figure below illustrate the data for ‘white’ people and illustrates literacy between the ages of 20 and 30. Why that interval? At the lower end, we don’t have literacy data for people under the age of 20 in 1850 & 1860. On the higher end, any effects of compulsory schooling will only affect those who were children and subject to the law – older people are immune to compulsory schooling legislation.

The graph illustrates that state literacy rates were rising throughout the period. The main exception is 1870. Maybe the demands of the civil war caused children to work at home or otherwise and forego schooling. So the increase from 1870 to 1880 is more of a catch-up to a previous trend than anything else. While it’s true that several states passed compulsory schooling laws in the 1870s, that doesn’t explain the widespread literacy improvements across most states.

After 1860, we can examine the younger people who were subject to the schooling laws. The figure below for people ages 10-20 tells a similar story to the one above.

My biased reading of the data is that initial compulsory schooling laws had at least an ambiguous effect on the overall trend of improving literacy. I’ll delve deeper in future posts.

PS – The literacy data is from IPUMS.

PPS – The compulsory schooling law dates are allegedly from “Department of Education, National Center for Educational Statistics, Digest of Education Statistics, 2004.” But I couldn’t find the original source. Kudos to anyone in the comments who can find it.

Age-Old Dads

Do you know how old you are?

I’m 33. Specifically, I’m 33 years, 29 days old. I don’t know the time of day that I was born, but my mom probably remembers within a couple of hours. My dad did not keep track of my age. Growing up, it was normal for him to take me to a sports registration event and need to ask me for plenty of my details in order to complete the paperwork.

Do you know the age of your children? Is it normal for parents to lose track? Or is it just the dads?  …Or just my dad? I have no idea what is typical.

But I do have some decent evidence that, had my dad lived in 1850, he would not have been such an anomaly. Consider exhibit A: A histogram of US ages in 1850. The population was only about 23 million at the time and we have the age for about 19 million of those people. So the graph is relatively representative (IPUMS census data).

Do you notice anything weird about the graph?

That’s the question I asked my Western Economic History class.

Continue reading

Aggregate Demand Regimes

Is inflation correlated with output growth?

Consider the AD-AS model which is often expressed in growth rates. Economists will often say that the short-run supply curve is flatter in the short-run and vertical in the long-run. In other words, aggregate demand policy can have SR output effects, and only has LR price effects.  Sounds good.

But there is a lot of baggage hiding behind “can have effects”. Often we’ll say that lackadaisical businesses cause a flatter SRS and that businesses with rational expectations have a vertical one. Also sounds good.

What causes the steepness of the SR supply curve? I’m sure that there are multiple determinants in regard to expectations. Here’s what got me on this topic. David Andolfatto shared the below graph and asked “Does lowflation necessarily mean low growth?.

Good question. My answer includes expectations concerning the monetary policy regime. Specifically, my answer was “It does in a regime of volatile and uncertain nominal income. Surprise AD growth pushes us up the SRAS.” Andolfatto called me out and in the right way, asking “What’s the evidences for this?

[…crickets…]

I had no evidence. I had the AS-AD model in hand and some logic – but no evidence. My logic is as follows. In a monetary regime that includes a constant rate of AD growth, output and price growth are inversely correlated. If NGDP grows at 5% always, then inflation falls when output growth rises. In other words, AD is exactly what people expect – illustrated as a vertical SRAS curve.

However, expectations are different in a regime of erratic AD. Let’s say that the rate of AD growth is unknown, but that the variance is known. If this is the world that you live in, then you make hay when the sun shines. Businesses sell more in periods of higher income. And, because they’re marching up the marginal cost curve, prices also rise. Alternatively, it may be that output growth is inflexible and prices rise as a goods are rationed.

Regardless of the truth, the above explanation is just story-telling. I had no evidence. What would the evidence even be? Here’s what I settled on. First, let’s express the AS-AD model in quarterly growth rates. In order to get a handle on monetary regime AD variance, I calculated the standard deviation of the NGDP growth rate by Fed Chair. Presumably, the Fed chair has a decent amount to do with monetary policy and the rear that occupies that chair is an indicator of when a regime begins and ends. I calculated the correlation between the GDP deflator and RGDP growth rates by regime. Below is the scatter plot.

What does it tell us? It tells us that regimes of stable AD growth experience a negative correlation between inflation and output growth. It also tells us that a AD growth volatility is associated with a positive correlation between inflation and output growth. So,  Does lowflation necessarily mean low growth? It does in a regime of volatile and uncertain nominal income.

(Of course this is all casual. It makes sense to me at first blush though. Having said that, the line of best fit also looks like it’s driven by the 2 extremely variable times: McCabe & Powell.)