Project Assignment I: Data Sources

It is time to start thinking about your final project. As I said at the beginning of the course, the purpose of the project is to give you a chance to work on a problem of your own choosing.

In this assignment I would like you to explore data sources that you may use for your project. You should turn in a one-page description of one to three sources that you explored and found interesting. For each source describe the data that is available and what sorts of questions could be answered with that data. Note that this project may be done individually, or in teams of 2 or 3. This assignment is not graded. The purpose is to provide early feedback on your ideas for the project.

Business data is valuable, and therefore, often private. You will need to be creative in using publicly available sources. The data does not have to be generated by a business to be relevant to business. For example, the Department of Transportation data on airlines, or SEC data on company filings can provide highly relevant insights.

Below is a list of links with possible data sources. This list is only to get you started. Keep in mind that most interesting analyses come from creative combinations. Don’t plan on downloading just one file - your data will most likely be a combination of several data sets that you will slice and dice and reshape.

  1. Data editor from BuzzFeed maintains a spredsheet entitled Data is Plural with links to many cool data source (though not many are directly business data).

  2. NY State Open data is a pretty good source of variety of datasets - from business perspective data on variety of business licences could be useful. Every state has its own open data platform, as do some cities (e.g. check out NYC or Boston).

  3. Marketing firm dunhumby has a few incredibly interesting data sets from the retail industry.

  4. Quandl is a well done data aggregator that works incredibly well with R. It has lots of financial and macroeconomic data.

  5. Data.gov lists all U.S. government generated data. For example, this data on hospital charges, or this data on consumer complaints to the Consumer Financial Protection Bureau (CFPB) beg to be analyzed.

  6. Center for Medicare and Medicaid Services has lots of interesting data sets including these on medicare charges by providers, or who prescribes which drugs, or drug pricing, or physicians, or private medicare plans, or small group and individual health insurance plans

  7. Home Mortgage Disclosure Act (HMDA) provides useful data on the home mortgage market in the U.S.

  8. Air Traffic Data from Department of Transportation

  9. Information on private pension plans from Department of Labor

  10. Data science competition platform kaggle has a number of very interesting data sets some of which are relevant to business. You can browse the competitions or datasets.

  11. Commodity flow data has fascinating info on flow of goods across the U.S.

  12. The business data section of the University of California Machine Learning Repository.

  13. Amazingly comprehensive data on European soccer including match results and betting odds