NHTSA Ninja: Slicing and Dicing Auto Data with F#

National Highway Traffic Safety association (NHTSA) (http://nhtsa.gov) has loads of publicly available data on auto complaints and recalls (although you have to dig a bit to find the actual data files).

Say you want to analyze National Highway Traffic Safety Administration customer complaints – a 500 megabyte (uncompressed) text file named FLAT_CMPL.txt. Well, here is some code that  I hope will show you how easily it is done with F#.

You can download related code from Gist: https://gist.github.com/28824fca92e020f7fcf5#file_nhtsaslicer.fs.

Here I will step through and explain some parts of the processing.

Let’s load the complaint records…

type ComplaintRecord =
{Company: string; Make: string; Model: string; Year: string; Date: DateTime}let complaintsFile = @”C:\ws\nhtsa\FLAT_CMPL\FLAT_CMPL.txt“let mapComplaint = (fun (line:string) ->
. . . <converts a text file line to a record – code shown later> . . .)

let complaints =
complaintsFile
|> File.ReadLines
|> Seq.map mapComplaint
|> Seq.toArray

The above code processes the contents of the file (in
about 20 sec) and creates an array of about 890,000 complaint records.

The first line defines a record type – ComplaintRecord –
with the attributes of interest; Company, Make, Model, Year and the Date of
the record. There are many more attributes in the file but we don’t need them for our analysis.

The last line reads the file and creates the actual ComplaintRecord
instances. Let’s dig into the last line to see how functional programming is
being applied here:

let complaints =
complaintsFile
|> File.ReadLines
|> Seq.map mapComplaint
|> Seq.toArray 
(this is the last line – each term is explained below)
let complaints = The name ‘complaints’ will be bound to results from the right of the ‘=’ sign – the array of complaint records.
|> This is the pipeline operator. It ‘feeds’ the results from the left hand  side to the computation on the right. We use a number of these to chain a set of computations together in a fluent way.
complaintsFile The path or location of the NHSTA complaints text file on the local disk.
File.ReadLines Reads a file from the path given and produces a sequence of lines.
Seq.map mapComplaint Converts (or maps) the sequence of lines to a sequence of complaint records. Seq.map is a higher order function because it accepts another function – mapComplaint– as an argument. Function mapComplaint does the actual conversion of the line into a  ComplaintRecord. Seq.map is an iterator that applies the function mapComplaint to each line in the sequence (from the previous step) and produces a record. (See after this table for the code for mapCompliant)
Seq.toArray Converts the sequence to an array structure and forces the actual file to be read and processed. Both File.ReadLines and Seq.map are lazy functions; they don’t do actual work unless called upon to do so. As Seq.toArray is being executed it is forcing Seq.map and File.ReadLines to do the work on a line-by-line basis. File.Readlines being lazy means that it does not first read the contents of the entire file into memory   before converting it into a collection of lines. Essentially we are reading the file in a streaming fashion but are working at a higher level of abstraction so we don’t have explicitly code for streaming.

For completeness, here is the mapComplaint function (along with toDate, a supporting date conversion function). Each complaint record is a tab delimited line (i.e. each field of the record is separated by a tab). The list of fields in a record are given in a separate NHSTA document.

let toDate s =
if s=”” then DateTime.MinValue
else XmlConvert.ToDateTime(s,”yyyyMMdd”)let mapComplaint = (fun (line:string) ->
let  fs = line.Split(‘\t’)
let date = fs.[7].Trim() |> toDate
{Company=fs.[2]; Make=fs.[3]; Model=fs.[4]; Year=fs.[5]; Date=date})

Now that we have the complaints loaded, let’s play with them…

let complaintCountByModelYear =
 complaints
|> Seq.filter (fun r -> r.Make = “<put make here e.g. HONDA, etc.>“)
|> Seq.countBy (fun r -> (r.Model, r.Year))
|> Seq.sortBy (fun (modelYear,count) -> -count)
|> Seq.toArray

Here we take the 890K complaint records (created in the  previous step) and chain a set of higher order functions together to find the count of complaints by model year for vehicles (of the make entered) and sort them in ascending order of count (in about 0.3 seconds).

Here is what the various chained functions are doing:

Seq.filter (fun r -> r.Make = “<..enter make name..>”) Filter the list to the make entered records only.Note each record is bound to the name ‘r’ – it is just a handle for referring to the record. We could have used any other letter or word instead of ‘r’.
Seq.countBy (fun r -> (r.Model, r.Year)) Create counts by each model and year combination. Note here we are creating counts by a compound key (Model and Year) relying upon structural equivalence and tuples.
Seq.sortBy (fun (modelYear,count) -> -count) Sort the resulting list in descending order
Seq.toArray As before Seq.toArray forces the entire computation to execute and creates an array of model, year and count structures.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s