I am in spring cleaning mode this week and plenty of projects around the house need attention. Now that the sun is out, I can see how dirty my windows really are. In addition to physical cleaning, I am also trying to clean up my files and data and I would encourage you to do the same. As January is for resolutions, the arrival of spring is a good tickler for cleaning.
There is a lot of talk about big data and the potential for new insights through careful analysis. What we don’t talk about enough is the fact that these brilliant insights will not be possible unless we organize and cleanse the data that we have. The biggest problems are missing data, inaccurate data, and redundant data. Until we clean up these problems the results of our analyses will continue to be flawed.
If you work with customer records, medical records, financial records or other critical data, you should be scrubbing constantly. For the rest of us, we should provide a good annual cleaning, at a minimum. It really all comes down to trust. Do I trust the results I am getting and do I trust the underlying data? If not, it is time to clean.
Information professionals say “garbage in, garbage out.” This is especially applicable to missing data. For example, a form prompts customers to supply their name, address, city, state, and zip code. If some customers fail to provide their zip code, you could never sort with accuracy on that field. If you wanted to send out advertising to a select geographic location based on zip code, you could not. Your data for this task is incomplete and useless. Maintaining strict rules on incoming data can alleviate this problem.
Inaccurate data is even worse than missing data. With missing data, you can see where you have holes even if you cannot sort on that information. With inaccurate data, you could be happily marching down the yellow brick road and not know how bad your results are. You may not even know the extent of the problem. The key to accurate data is to put filters in place so the data is analyzed for accuracy, correct values, and values in the correct field.
Another problem is redundant data. This can come from poor version control or not replacing old values or information with newer values. As an example, think about your personal digital photo storage. How many times have you stored the same photo? If you are anything like me, you have a copy on your phone, your computer, possibly your tablet, and one or two memory cards. The good news is, if you ever had a device failure then you have plenty of backup sources, but the bad news is you have created redundant data or images. With the introduction of cloud computing, we should be able to synch everything to the cloud and have one clean filtered copy of everything. Unfortunately, there seems to be some lingering trust issues with the cloud, but hopefully we can get beyond that.
Big data can get out of control quickly without well thought out strategies for input, organization, and cleansing. This year, as part of your spring cleaning, identify those areas where you have dirty data and vow to get them under control before it controls you.
Do you have any advice for cleaning big data and keeping it clean? Are there any products that have worked well for you? Cleaning data is harder than cleaning windows but the results can be just as bright.
Kelly Brown is an IT professional and assistant professor of practice for the UO Applied Information Management Master’s Degree Program. He writes about IT and business topics that keep him up at night.