Blog: Education by the Numbers

Clean, clean, clean before you crunch those big data sets

Anyone interested in how data science might transform education should read The Dirty Little Secret of Big Data Projects.  David Dietrich, an impressive data geek consultant at EMC’s education unit who’s been involved with a big data lab at MIT, wrote that 80% of your time on a data project will be spent on the tedious, unsexy task of cleaning up the data.Often, people are so excited to start crunching their data that they end up with wrong answers because they haven’t cleaned up and prepared their data properly.

If you want to try some data cleaning at home, Dietrich suggests that unsophisticated types (such as myself) should tinker around with these two tools: Open Refine (formerly Google Refine) and Data Wrangler (from Stanford).

Comments

Post new comment

6835516919813 » If you have a visual disability, please type the numbers two one three three into the box. Your submission will be promptly reviewed by a validation service and sent to the site administrators.
By proving you are not a machine, you help us prevent spam and keep the site secure.