skillner.word_processing.PorterStemmer#

class skillner.word_processing.PorterStemmer(to_lowercase=True, mode='NLTK_EXTENSIONS')[source]#

A word stemmer based on the Porter stemming algorithm.

An instance of PorterStemmer has a function-like behavior and hence can be called directly on a word.

Parameters:
to_lowercase: bool, default=True

If True, the word is lowercased before stemming.

mode: ‘NLTK_EXTENSIONS’ or ‘MARTIN_EXTENSIONS’ or ‘ORIGINAL_ALGORITHM’

Mode used to stem a word, see below for more details. default 'NLTK_EXTENSIONS'.

Notes

Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. There are thus three modes that can be selected by passing the appropriate constant to the class constructor’s mode attribute:

PorterStemmer.ORIGINAL_ALGORITHM

An implementation that is faithful to the original paper.

Note that Martin Porter has deprecated this version of the algorithm. Martin distributes implementations of the Porter Stemmer in many languages, hosted at:

https://www.tartarus.org/~martin/PorterStemmer/

and all of these implementations include his extensions. He strongly recommends against using the original, published version of the algorithm; only use this mode if you clearly understand why you are choosing to do so.

PorterStemmer.MARTIN_EXTENSIONS

An implementation that only uses the modifications to the algorithm that are included in the implementations on Martin Porter’s website. He has declared Porter frozen, so the behavior of those implementations should never change.

PorterStemmer.NLTK_EXTENSIONS (default)

An implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web.

For the best stemming, you should use the default NLTK_EXTENSIONS version. However, if you need to get the same results as either the original algorithm or one of Martin Porter’s hosted versions for compatibility with an existing implementation or dataset, you can use one of the other modes instead.

References

This class is a copy/past of NLTK PorterStemmer class with slight modification cf. https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py

Refer to original paper:

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

and the link https://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.

__init__(to_lowercase=True, mode='NLTK_EXTENSIONS')[source]#

Methods

__init__([to_lowercase, mode])

stem(word)

Stem word.

Attributes

MARTIN_EXTENSIONS

NLTK_EXTENSIONS

ORIGINAL_ALGORITHM

stem(word: Word)[source]#

Stem word.

Parameters:
word: Word

The word to stem.