Phone budgeting, geek style
The Apple iPhone 3G was released here a couple of weeks ago. I've held out a long time getting a modern phone and the iPhone ticks many of the boxes I've been waiting for. Though as I've waited this long, I'm happy to watch how the costs settle and pick the best option for me. The selected carriers have been slow to announce pricing, but now that it's out, I see that it really isn't all that attractive.
So in order to justify a new phone purchase, I decided to do a little research into my current mobile phone habits. Knowing how and when I use my current phone places me in a much more informed position when comparing the available options. Fortunately, Virgin Mobile provide a detailed call history to browse. Unfortunately however, with so much data easily available, I got a little carried away with the analysis.
What follows turns out to be a real-world tutorial on using Excel, Python and R to extract meaning from a table of data.
The first job was to grab the data from Virgin's site and paste it into Excel. Excel is one of the best spreadsheet programs I've come across, and I give that accolade as devout Microsoft critic. One of its strengths is its ability to accept foreign data from the clipboard, and it did not disappoint in this instance. This screenshot actually comes from Appleworks, which I had on hand at the time.

With the data safely in spreadsheet form, I created a cumulative cost column (G1 = F1 + G2, fill down) and plotted it. This graph was produced with Numbers, part of Apple's iWork suite, but of course any graphing tool would help pull some meaning from this column of numbers.

Already I'm getting a sense for my usage patterns. I've only spent about $50 in three months, and the increments are relatively linear. But we have available many more dimensions of data, and it would be a shame to waste them! For dealing with multidimensional data like this, a spreedsheet program quickly becomes inefficient. A statistics program is the ideal tool here, and R is my program of choice.
Once the syntax was nailed down, importing was as simple as copying the rows in Excel, and executing the following statement in R.
> f <- read.table("clipboard", sep="\t", header=TRUE)
That command inserts into the variable f, the data frame existing in the clipboard, treating the first row as a row of column headings.
R will make a natural, but incorrect assumption about the phone number column however, assuming that its contents are ordered numerically. This will not do, since the phone numbers are in fact, just labels for people I've called. This is easily fixed.
> f$Number <- as.factor(f$Number)
R creates "levels" for each number and treats all similar numbers as members of that level. Just what we want, except... some observation shows that the same number is occasionally represented with the superfluous national or international calling prefixes. Time to crack out my favourite programming tool for super fast turn around time. Perl, being well suited for text parsing and editing, would have been a fine choice, but I tend to get carried away looking for the best regular expression one liner to pull it all off, instead of just writing some more natural procedural code. Instead, I turned to Python.
Here's a little script to trim the 1161 and 61 from the start of the longer phone numbers.
#!/usr/bin/python
#
# Trim the standard phone prefixes off
#
import fileinput
INPUT_FILE = "phonenums.txt"
fhout = open("phonenumsout.txt", 'w')
for line in fileinput.input(INPUT_FILE):
if line.startswith("1161"):
fhout.write(line[4:])
elif line.startswith("61"):
fhout.write(line[2:])
else:
fhout.write(line)
I pasted the numbers column into a file, ran the script and pasted the results back into the spreadsheet. A bit more thought would have streamlined the process, but of course, there's usually no need to optimise a process that only happens a couple of times.
Now the numbers are nicely grouped into representations of people. But it becomes clear that a phone number is a pretty unnatural representation of a person, and that a name would be better. Since the data has already been converted to a factor, changing the name of the category (a "level" in R) is straightforward. I executed the following in R, one line for each number I could identify:
> levels(f$Number)[levels(f$Number) == 123456789] <- "Bob"
> levels(f$Number)[levels(f$Number) == 987654321] <- "Terry"
> levels(f$Number)[levels(f$Number) == 192837465] <- "Jane"
Aside: It turns out my Nokia phone will reveal the matching name in my contact book for a number if I enter that number, dial, hang up before it rings, and then look through the "Recent calls" list.
Jumping back into R, the output of the summary command shows that the Number column is already proving valuable, but Time and Duration are significantly less valuable. The exact time of day is not interesting from a statistics point of view, and the Duration column is formatted as minute:second, obfuscating the linear numerical duration value we'd really like to know.
Lets start with Duration. We want to numerically describe the data somehow. To do that it's easy enough to pick a second as the base unit. Python will rip through the minute:second format and give us the number of seconds.
#!/usr/bin/python
#
# Convert min:sec format to seconds
#
import fileinput
INPUT_FILE = "minsec.txt"
fhout = open("sec.txt", 'w')
for line in fileinput.input(INPUT_FILE):
(min, sec) = line.split(':')
secs = int(sec) + 60*int(min)
fhout.write("%d\n" % (secs, ))
Next lets deal with Time. As it stands the data just consists of labels. Since Time actually measures something, it would be much more useful as a number. Specifically, it is a measure of the number of hours and minutes since a reference time (midnight for AM and midday for PM). Lets pick one reference (midnight is logical enough) and convert the hours and minutes to just minutes. Python will do a fine job.
#!/usr/bin/python
#
# Convert hh:mm am/pm format to minutes since midnight
#
import fileinput
INPUT_FILE = "hourMinMeridiem.txt"
fhout = open("mins.txt", 'w')
for line in fileinput.input(INPUT_FILE):
(hour, other) = line.split(':')
(min, meridiem) = other.split(' ')
if hour == "12":
mins = int(min)
else:
mins = int(min) + 60*int(hour)
if meridiem.startswith("PM"):
mins += 12*60
fhout.write("%d\n" % (mins, ))
Right, now we're set to actually analyse this data in R. Already, the summary command is revealing interesting things.

Things that stand out immediately are the breakdown of people I've called, and that I've made twice as many SMSs as phone calls. We still haven't pulled anything interesting from the Time or Duration (Seconds) field. We'll get to those in a tic.
Lets start with a cost breakdown. The top of the Numbers column is concerning me. How much have I spent on this person in the last 3 months? R will reveal all:
> costSuspect <- f$Cost[f$Number == "Suspect"]
> sum(costSuspect)
9.8
Hmm, $9.80. Not bad. Well the Cost summary shows the most I spent on a single call was $2.20. I wonder who I'm calling to spend more than $1.50 on a single conversation? Simple:
> f$Number[f$Cost > 1.5]
Good-o, what about the costs associated with SMSs compared to calls?
> sum(f$Cost[f$Type == "V2V Call" | f$Type == "Standard Call"])
33.3
> sum(f$Cost[f$Type == "V2V SMS" | f$Type == "SMS"])
18
And if I wanted to check out who and when I'm making use of the Virgin to Virgin rates, I'd create a new data frame for analysis:
> f.V2V <- f[f$Type == "V2V Call" | f$Type == "V2V SMS"]
That's enough cost analysis. I'm curious now about the times I use the phone and for how long. I'm really after the frequency patterns - how common certain practices are. A histogram is perfect this. Firstly, call duration:
> hist(f$Seconds[f$Type != "V2V SMS" & f$Type != "SMS"]/60, 38, main="Call Duration", xlab="Minutes")
This will produce a histogram of the seconds data, excluding the SMSs, converting the seconds to minutes, breaking it into 38 sections (since the max is 19 minutes, this will give 30 second divisions), and give it a title and an x-axis label.

And similarly with call times:
> hist(f$Time/60, 48, main="Time of Call", xlab="Hour")
This time I've divided by 60 to get to hours, and split it into 48 sections to break the day into half hours. The resulting histogram quite clearly shows that there are spikes of usage around 8pm and 10pm, which steadily drops off by 5:30 in the morning.

With the significant features extracted from each column, its simple to delve into the specifics. For example, just who was I calling at 5 in the morning? Or who was I talking to for 19 minutes?
> f$Number[f$Time > 5*60 & f$Time < 6*60]
> f$Number[f$Seconds > 15*60]
There's a ton more analysis R is capable of (the pairs(f) command is a good place to continue) but that sums up all I needed to know about my phone habits! Comparing my behaviour with the current spectrum of iPhone plans suggests it may well be worthwhile holding off and watching the fallout...