Skip to contents

Preprocessing pipeline: 1. Filter short terms 2. Filter common terms appearing in most documents 3. Filter terms appearing in few documents 4. Filter only nouns

Usage

preprocess_corpus(df, doc)

Arguments

df

Data frame in tidytext format one token per document per line

doc

column with document id

Value

cleaned data frame

Examples

preprocess_corpus(aspol, kunta)
#> # A tibble: 182,564 × 15
#>    kunta   sent ID    FORM  LEMMA UPOSTAG XPOSTAG FEATS HEAD  DEPREL DEPS  MISC 
#>    <chr>  <int> <chr> <chr> <chr> <chr>   <chr>   <chr> <chr> <chr>  <chr> <chr>
#>  1 Enont…     4 1     "KUN… kunta NOUN    _       Case… 0     root   _     "Spa…
#>  2 Enont…     5 1     "VUO… vuok… NOUN    _       Case… 2     nmod:… _     "Spa…
#>  3 Enont…     5 2     "KEH… kehi… NOUN    _       Case… 0     root   _     "Spa…
#>  4 Enont…    19 2     "teh… tehdä VERB    _       Mood… 0     root   _     "_"  
#>  5 Enont…    21 1     "\fS… sisä… NOUN    _       Case… 0     root   _     "Spa…
#>  6 Enont…    21 3     "Joh… johd… NOUN    _       Case… 1     appos  _     "_"  
#>  7 Enont…    24 2     "Väe… väes… NOUN    _       Case… 0     root   _     "Spa…
#>  8 Enont…    26 1     "Eli… elin… NOUN    _       Case… 0     root   _     "_"  
#>  9 Enont…    26 3     "työ… työ#… NOUN    _       Case… 1     conj   _     "Spa…
#> 10 Enont…    38 1     "Kii… kiin… NOUN    _       Case… 4     nsubj… _     "_"  
#> # ℹ 182,554 more rows
#> # ℹ 3 more variables: doc <chr>, df <int>, df_ratio <dbl>