The US government is great at collecting data, but not so good at sharing it in easy-to-use ways. When people try to access these datasets they either get discouraged and give up, or spend hours getting the data into a usable form. One of the crazy things about this is all the duplicated effort- hundreds of people might end up spending hours cleaning the data in mostly the same way. Ideally the government would just post a better version of the data on their official page. But barring that, researchers and other “data heroes” can provide a huge public service by publicly posting datasets that they have already cleaned up- and some have done so.
I just added a data page to my website that highlights some of these “most improved datasets”:
- the IPUMS versions of the American Community Survey, Current Population Survey, and Medical Expenditure Panel Survey
- The County Business Patterns Database, harmonized by Fabian Eckert, Teresa C. Fort, Peter K. Schott, and Natalie J. Yang
- Code for accessing the Quarterly Census of Employment and Wages by Gabriel Chodorow-Reich
- The merged Statistics of US Business, my own attempt to contribute
I hope to keep adding to this page as I find other good sources of unofficial/improved data, and as I create them (one of my post-tenure goals). See the page for more detail on these datasets, and comment here if you know of existing improved datasets worth adding, or if you know of needlessly terrible datasets you think someone should clean up.
This will be a great resource. From my neck of the woods, the federal government is spending a lot of bureaucracy trying to systematize open data standards from its science agencies, and different groups are adopting their own standards…. but the top-down and bottom-up approaches have not yet met in the middle.
I’m impressed with the World Bank’s open data efforts. That’s the gold standard in my mind.