Personal Data Curation – Part 1

Data. Data data data. Data is what I sift through on a daily basis. I process it, I search through it for clues, and I use it to jog my memory. I can’t get away from it. I can’t hate it. Data is, data was, and data will be.

I have a different view of data than most. I have learned that after having countless discussions with people. I believe that my personal data is precious and needs to be saved. I have emails going back seven years (I really wish it went back about 16 years when I started email). I have close to 15k pictures on my Flickr account. I’m dealing with a way to save personal videos now that I have a kid. Why do I have this data? It is history. It may not be important history or something 99% of the world would ever care to see. It is mine.

People have asked, “why don’t you just trash the stuff you don’t need?” Well let’s look at work email, I routinely go back and find something from a few years ago that I can reuse or reference. From a practical standpoint that alone justifies its storage for me. That’s not really why I save it. I save it for the generations that come after me.

No matter how inane and boring, I would to have every single letter that my great-grandfather wrote. I think the insight would be fascinating. The stories you could get that wouldn’t be there any other way, just because he wouldn’t have passed it on or reiterated it any other way. I want my great-grandchildren to have the same experience I would like to have by going through my stuff. There are some issues. How do you make the data practical to yourself in the present, while also making it viable for those in the future? Some people would think it would be the same thing. It is not.

I’m someone who in their email archive used to literally have hundreds of folders. I would organize things down to the exact minutia of what it was for. I did realize after a while I would just search through the database of email instead of drilling down to folders in 99% of scenarios. What the folders did do is make email syncing a pain in the butt. It also made finding something quickly on anything except Gmail search a pain in the butt. I would think “well I may have referenced this thing – so let’s look over here”. This was a bust. I also had multiple email accounts from past jobs saved in different mail files. So I would have to open multiple PST files and hope it was located within this one here and not the one over there. It was painful, to say the least.

So this year I made a master mail file. It literally took weeks to get done. I had to copy things from offline to online, and then back again. I had to track down all my different mail files. I had to de-duplicate multiple times after the merge. I had downloaded full mail so many times to different mail files that had some unique messages that I locked Gmail up a few times from all the access attempts of running this software. In the end, I have finally a usable single mail archive file – that should not have any duplicates. It’s only 6.5 GB – but I think I went on a mass run about 3-4 years ago tearing out large emails (I feel really bad about that now).

This email archive has almost 100k emails in it and has been cleaned. Extremely clean, with only a few exceptions it is either something I wrote, I received personally from someone, something I generated, or something I would need to reference back to. I cleaned out all old email lists where I wasn’t part of the conversation. I cleaned out old advertisements – these were not really spam messages, but more something not targeted at me personally. These messages took about 4 GB or more off the archive (remember I did tons of de-duplication). I thought about what was important. Back to thinking about my great-grandfather, I realized I did not care about having every single mail flyer he may have looked at.

To keep it clean, I set up some crazy Gmail filters. There are a lot of email senders now that go directly to the trash. There are others that I don’t necessarily want to hit my inbox. If something hits my inbox it means a couple of things. I either need to see this mail or I need to act on it. If I don’t need to do either of those I decide if I need to keep receiving mail from the sender. If I don’t need future mail from the sender I send it directly to the trash when it comes in using a filter. I do routinely check my trash with a quick glance to see if something is mismarked. Some data though, like a list of the songs I listened to throughout the day came through last.fm, this data I want to keep this – but I don’t want it in my inbox. That type of data gets moved to a “sort” folder. When I get to this data I will act on it if I need to, but otherwise, I put it in its proper place.

I broke myself free of folderitis by using a simple method. I have a master folder for each year that I have mail from. Under each of these folders, I placed each of the twelve months. These and the sort folder are all that exists outside of the default folders in my email archive. For example, all email that came in over the weekend has been moved to 2010 / 10 – Oct. folder. A generalized time frame has been more than enough for me. This process has taken months to get to, but my data is not so much sorting email as much as acting upon it now.

I’m in the process of working on photos the same way; I’ll detail that information in a follow-up. Since this is an ongoing thing that will take another few months to get through, there might be some different insights into it.