Porter Stemmer in F#

In natural language processing (NLP), one of the first steps is to find the ‘stems’ of words from the ‘paradigm’ forms found in regular text. For example “construction” and “constructing” are paradigms of the stem “construct”. (Note there is a *lot* more to NLP, even stemming, than what can be described here).

For textual analysis, one is usually interested in the stems. One of the best known algorithms for stemming is Porter’s Stemming Algorithm, described in this paper: http://tartarus.org/martin/PorterStemmer/def.txt.

The corresponding F# implementation is here: http://fssnip.net/eR

On the main site – http://tartarus.org/martin/PorterStemmer/ – there is a list of implementations of the Porter algorithm in different languages. It is interesting to compare the implementations to see how different languages deal with essentially the same algorithm.

I have not looked exhaustively but I believe the Haskell implementation is the shortest and the cleanest (only about 150 lines). The F# code I wrote is about 190 lines.

In terms of readability, there are parts of the Haskell code that are very obtuse. It takes quite a bit of mental parsing to figure out what is going on. By comparison, I wrote the F# code by reading the paper section by section and just translating that to code. If you look at the rules in the original paper, you will find an almost exact correspondence in the F# code.

For F# I used Active Patterns extensively for pattern matching and as a result you get a Prolog like feel to the code.

2 thoughts on “Porter Stemmer in F#

  1. Pingback: F# Weekly #44, 2012 | Sergey Tihon's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s