Copyright | (c) CNRS 2017-Present |
---|---|
License | AGPL + CECILL v3 |
Maintainer | team@gargantext.org |
Stability | experimental |
Portability | POSIX |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
In linguistic morphology and information retrieval, stemming is the
process of reducing inflected (or sometimes derived) words to their word
stem, base or root form—generally a written word form. The stem
needs
not be identical to the morphological root of the word; it is usually
sufficient that related words map to the same stem, even if this stem is
not in itself a valid root.
Source : https://en.wikipedia.org/wiki/Stemming
A stemmer for English, for example, should identify the string "cats"
(and possibly "catlike", "catty" etc.) as based on the root "cat", and
"stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming
algorithm reduces the words "fishing", "fished", and "fisher" to the
root word, "fish". On the other hand, "argue", "argued", "argues",
"arguing", and "argus" reduce to the stem "argu" (illustrating the
case where the stem is not itself a word or root) but "argument" and
"arguments" reduce to the stem "argument".
Synopsis
- data StemmingAlgorithm
- stem :: Lang -> StemmingAlgorithm -> Text -> Text
- data Lang
Types
data StemmingAlgorithm #
A stemming algorithm. There are different stemming algorithm, each with different tradeoffs, strengths and weaknesses. Typically one uses one or the other based on the given task at hand.
PorterAlgorithm | The porter algorithm is the classic stemming algorithm, possibly one of the most widely used. |
LancasterAlgorithm | Slight variation of the porter algorithm; it's more aggressive with stemming, which might or might not be what you want. It also makes some subtle chances to the stem; for example, the stemming of "dancer" using Porter is simply "dancer" (i.e. it cannot be further stemmed). Using Lancaster we would get "dant", which is not a prefix of the initial word anymore. |
GargPorterAlgorithm | A variation of the Porter algorithm tailored for Gargantext. |
Instances
Show StemmingAlgorithm # | |
Defined in Gargantext.Core.Text.Terms.Mono.Stem showsPrec :: Int -> StemmingAlgorithm -> ShowS # show :: StemmingAlgorithm -> String # showList :: [StemmingAlgorithm] -> ShowS # | |
Eq StemmingAlgorithm # | |
Defined in Gargantext.Core.Text.Terms.Mono.Stem (==) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # (/=) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # | |
Ord StemmingAlgorithm # | |
Defined in Gargantext.Core.Text.Terms.Mono.Stem compare :: StemmingAlgorithm -> StemmingAlgorithm -> Ordering # (<) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # (<=) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # (>) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # (>=) :: StemmingAlgorithm -> StemmingAlgorithm -> Bool # max :: StemmingAlgorithm -> StemmingAlgorithm -> StemmingAlgorithm # min :: StemmingAlgorithm -> StemmingAlgorithm -> StemmingAlgorithm # |
Universal stemming function
stem :: Lang -> StemmingAlgorithm -> Text -> Text #
Stems the input Text
based on the input Lang
and using the
given StemmingAlgorithm
.
Handy re-exports
Language of a Text For simplicity, we suppose text has an homogenous language
- EN == english
- FR == french
- DE == deutch
- IT == italian
- ES == spanish
- PL == polish
- ZH == chinese
... add your language and help us to implement it (:
All languages supported NOTE: Use international country codes https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes TODO This should be deprecated in favor of iso-639 library