gargantext-0.0.4.9.5: Search, map, share
Copyright(c) CNRS 2017 - present
LicenseAGPL + CECILL v3
Maintainerteam@gargantext.org
Stabilityexperimental
PortabilityPOSIX
Safe HaskellNone
LanguageHaskell2010

Gargantext.Core.Text.Terms

Description

An n-gram is a contiguous sequence of n items from a given sample of text. In Gargantext application the items are words, n is a non negative integer.

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on.

Source: https://en.wikipedia.org/wiki/Ngrams

TODO group Ngrams -> Tree compute occ by node of Tree group occs according groups

compute cooccurrences compute graph

Synopsis

Documentation

data TermType lang Source #

Constructors

Mono 

Fields

Multi 

Fields

MonoMulti 

Fields

Unsupervised 

Fields

Instances

Instances details
Generic (TermType lang) Source # 
Instance details

Defined in Gargantext.Core.Text.Terms

Associated Types

type Rep (TermType lang) :: Type -> Type #

Methods

from :: TermType lang -> Rep (TermType lang) x #

to :: Rep (TermType lang) x -> TermType lang #

type Rep (TermType lang) Source # 
Instance details

Defined in Gargantext.Core.Text.Terms

type Rep (TermType lang)

tt_windowSize :: forall lang. Traversal' (TermType lang) Int Source #

tt_ngramsSize :: forall lang. Traversal' (TermType lang) Int Source #

tt_model :: forall lang. Traversal' (TermType lang) (Maybe (Tries Token ())) Source #

tt_lang :: forall lang lang. Lens (TermType lang) (TermType lang) lang lang Source #

extractTerms :: TermType Lang -> [Text] -> IO [[Terms]] Source #

Sugar to extract terms from text (hiddeng mapM from end user). extractTerms :: Traversable t => TermType Lang -> t Text -> IO (t [Terms])

data ExtractedNgrams Source #

Constructors

SimpleNgrams 

Fields

EnrichedNgrams 

cleanNgrams :: Int -> Ngrams -> Ngrams Source #

terms :: TermType Lang -> Text -> IO [Terms] Source #

Terms from Text Mono : mono terms Multi : multi terms MonoMulti : mono and multi TODO : multi terms should exclude mono (intersection is not empty yet)

type WindowSize = Int Source #

Unsupervised ngrams extraction language agnostic extraction TODO: remove IO TODO: newtype BlockText

newTries :: Int -> Text -> Tries Token () Source #

uniText :: Text -> [[Text]] Source #

TODO removing long terms > 24