I became aware of a few new DataBC data sets a few weeks ago, consisting of various spreadsheets of government purchasing card expenditures stretching from the fiscal year ending in 2007 to the fiscal year ending in 2010. (The fiscal years end March 31, hence the mad rush to spend, spend spend…)
But it was only a week ago that I decided to take a closer look.
I was reading Stephen Quinn’s Globe and Mail column — the entry about the non-renaming of BC Place — and one of the comments mentioned a local political blog, which I checked out and which led to a link to the most recent update (FYE11) in the data set of purchasing cards spending.
Now, nothing riles an open-data / data-journalist type more* than tabulated information being presented as a PDF, so I attacked it like the proverbial starving Chihuahua on a pork chop.
In the process, I downloaded the source code for xpdf and compiled it myself instead of using the available binary (because I like doing extra work); ripped the PDF to a text file. Two spaces marked the minimum “border” between table columns, so in TextWrangler I did a grep find and replace on any combination of two or more spaces (\s\s+), assuming double-spaces didn’t occur in between words in the same column, and inserted pipes, i.e. | , which is my delimiter of choice. Only after that, did I check back on the DataBC catalogue and lo! The spreadsheet for the most recent fiscal year (2011) was there, all along. (Lesson learned!? Search a little harder before diving in?)
So now what?** The spreadsheets are offered in either csv or xls format. The Microsoft Excel (xls) files might look pretty, but are actually a pain — the purchasing card information is spread across 30 or so tabs labelled for the 30 or so special offices and ministries that did the spending. See a snapshot of the tabbed organization below:
In the csv files, which are flat dumps (no tabs), the office/ministry name is literally spelled out in an additional column.
So, I’m going to take the five spreadsheets for the past five years, make sure they have a “year’ column, dump out the summary info and paste the raw info into one file… and upload that raw paste to BuzzData for sharing.
Oh, and by “paste” I mean of course to use the command line to concatenate the five files together, since each has around 40,000 rows, and my open spreadsheet programs crap out at the “magic number” of rows: 65,536. (Some newer versions of Excel handle more rows). But anyhow, in my Mac Terminal (bash shell) it’s easy:
cat FYE07.csv FYE08.csv FYE09.csv FYE10.csv FYE11.csv >> ALL.csv
And I got a nice 18.6 MB file. In total, there are 254,443 rows, according to the output of:
wc -l ALL.csv
(Oh yeah… let’s add that row of headers in, and make it an even 254,444 rows.)
Next comes the data cleaning, to try and add value to the quarter of a million rows of purchasing card spending. I know I want to resolve or reconciliate similar same-seeming names; add more location information or addresses where possible; and add at least two “editorial” columns — one where I take my best guess for what type of spending is happening (food, booze, hotel stay, parking, etc.) and another column to add in any other “editor’s notes” that come to mind.
I know that this “data editing” will take some time. This is the real work. So, armed with Google Refine, I begin.
Updates will come.
- – - -
* Okay, there are probably a few things that rile these types more, but I can’t think of one immediately.
** So now what? Besides making this data set useful by doing some editing, I need to get answers to some basic reporting questions: What exactly are B.C. government purchasing cards? What at the supposed rules or limits on using them? How many people have cards in each ministry or special office? What other ways is money spent in each department — and can I create a bigger picture of where money is going with that information?