In natural language processing (NLP), one of the first steps is to find the ‘stems’ of words from the ‘paradigm’ forms found in regular text. For example “construction” and “constructing” are paradigms of the stem “construct”. (Note there is a *lot* more to NLP, even stemming, than what can be described here).
For textual analysis, one is usually interested in the stems. One of the best known algorithms for stemming is Porter’s Stemming Algorithm, described in this paper: http://tartarus.org/martin/PorterStemmer/def.txt.
The corresponding F# implementation is here: http://fssnip.net/eR
On the main site – http://tartarus.org/martin/PorterStemmer/ – there is a list of implementations of the Porter algorithm in different languages. It is interesting to compare the implementations to see how different languages deal with essentially the same algorithm.
I have not looked exhaustively but I believe the Haskell implementation is the shortest and the cleanest (only about 150 lines). The F# code I wrote is about 190 lines.
In terms of readability, there are parts of the Haskell code that are very obtuse. It takes quite a bit of mental parsing to figure out what is going on. By comparison, I wrote the F# code by reading the paper section by section and just translating that to code. If you look at the rules in the original paper, you will find an almost exact correspondence in the F# code.
For F# I used Active Patterns extensively for pattern matching and as a result you get a Prolog like feel to the code.
Pingback: F# Weekly #44, 2012 | Sergey Tihon's Blog
its an english porter stemmer. not a general porter stemmer. try again.