Free dataset to download and practice with in r






















This is the fifth post in a series of posts on how to build a Data Science Portfolio. You can find links to the other individual posts in this series at the bottom of the post. Luckily, there are online repositories that curate data sets and mostly remove the uninteresting ones. There are a few considerations to keep in mind when looking for a good data set for a data visualization project:.

A good place to find good data sets for data visualization projects are news sites that release their data publicly. FiveThirtyEight is an incredibly popular interactive news and sports site started by Nate Silver. FiveThirtyEight makes the data sets used in its articles available online on Github.

Socrata OpenData is a portal that contains multiple clean data sets that can be explored in the browser or downloaded to visualize. A significant portion of the data is from US government sources, and many are outdated. You can explore and download data from OpenData without registering. You can also use visualization and exploration tools to explore the data in the browser.

Sometimes you just want to work with a large data set. You might use tools like Spark or Hadoop to distribute the processing across multiple nodes. Things to keep in mind when looking for a good data processing data set:. A good place to find large public data sets are cloud hosting providers like Amazon and Google. They have an incentive to host the data sets, because they make you analyze them using their infrastructure and pay them.

Amazon makes large data sets available on its Amazon Web Services platform. You can download the data and work with it on your own computer, or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here. Amazon has a page that lists all of the data sets for you to browse. Google lists all of the data sets on a page. Wikipedia is a free, online, community-edited encyclopedia.

Wikipedia contains an astonishing breadth of knowledge, containing pages on everything from the Ottoman-Habsburg Wars to Leonard Nimoy. Additionally, Wikipedia offers edit history and activity, so you can track how a page on a topic evolves over time, and who contributes to it. You can find the various ways to download the data on the Wikipedia site. In order to be able to do this, we need to make sure that:. There are a few online repositories of data sets that are specifically for machine learning.

These data sets are typically cleaned up beforehand, and allow for testing of algorithms very quickly. Kaggle is a data science community that hosts machine learning competitions. There are a variety of externally-contributed interesting data sets on the site. Kaggle has both live and historical competitions. You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition.

You can download data from Kaggle by entering a competition. Each competition has its own associated data set. There are also user-contributed data sets found in the new Kaggle Data sets offering. Although the data sets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting data sets.

Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Sometimes, it can be very satisfying to take a data set spread across multiple files, clean them up, condense them into one, and then do some analysis.

In data cleaning projects, sometimes it takes hours of research to figure out what each column in the data set means. These types of data sets are typically found on aggregators of data sets.

These aggregators tend to have data sets from multiple sources, without much curation. Here are a handful of sources for data to work with. All of the datasets listed here are free for download. If you want more, it's easy enough to do a search. World Bank Data - Literally hundreds of datasets spanning many decades, sortable by topic or country. This is an outstanding resource. Gapminder - Hundreds of datasets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.

Most of these datasets come from the government. Kaggle - Kaggle is a site that hosts data mining competitions. Each competition provides a data set that's free for download. This list has several datasets related to social networking. Lots of fun in here! Million Song Dataset - This is a collection of audio features and metadata for a million contemporary popular music tracks. Energy Information Administration - This site offers a number of datasets on energy production, consumption, sources, etc.

Reddit Datasets - This last one isn't a dataset itself, but rather a social news site devoted to datasets. It's updated regularly with news about newly available datasets.



0コメント

  • 1000 / 1000