Skip to contents

Examples of pre processing text data with dplyr and stringr in data frame. Many of the pre process steps introduced here are possible to carry out after conversion to document-term-matrix using quanteda package presented in the next articles.

Aim is to filter out uninteresting, and potentially false terms. Background for pre processing text data:
* Ristilä A. & K. Elo (2023). Observing political and societal changes in Finnish parliamentary speech data, 1980–2010, with topic modelling. Parliaments, Estates and Representation, 43:2, 149–176.
* Matthew J. D. & A. Spirling (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis vol. 26:168–189.

Common preprocessing steps include:
1. Select only nouns verbs and adjectives
2. Remove words which contain numbers
3. Filter out foreign terms
4. Drop terms which appear in less than 0.5% of documents (at least 5 docs here) 5. Drop terms which appear in more than 99% of documents (at most 65 docs here)
6. Additionally we drop very short terms (under 2 character in original FORM and under 3 characters in base LEMMA)

Original terms total 451660

aspol |> count(LEMMA, sort = TRUE)
#> # A tibble: 34,935 × 2
#>    LEMMA        n
#>    <chr>    <int>
#>  1 .        26183
#>  2 ja       17283
#>  3 olla     16098
#>  4 ,        15519
#>  5 !         8029
#>  6 kaupunki  4081
#>  7 )         4029
#>  8 asunto    3954
#>  9 vuosi     3685
#> 10 (         3408
#> # ℹ 34,925 more rows

Let’s take a closer look what we are throwing away while carrying out our pre processing pipeline.

Filtering nouns, verbs and adjectives drop following terms:

aspol |> filter(!UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>  # Note negation operator `!` here
  count(LEMMA, sort = TRUE)
#> # A tibble: 16,812 × 2
#>    LEMMA     n
#>    <chr> <int>
#>  1 .     25866
#>  2 ja    17278
#>  3 ,     15519
#>  4 olla  14602
#>  5 !      8028
#>  6 )      3468
#>  7 (      3407
#>  8 %      3066
#>  9 joka   2710
#> 10 sekä   2307
#> # ℹ 16,802 more rows

Now we filter nouns, verbs and adjectives.Which of those words contain numbers (in their original FORM at least).

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(str_detect(FORM, "[0-9]+")) |> 
  count(LEMMA, sort = TRUE)
#> # A tibble: 594 × 2
#>    LEMMA             n
#>    <chr>         <int>
#>  1 1.              124
#>  2 2000#luku       110
#>  3 2010#luku       107
#>  4 m2               90
#>  5 2.               87
#>  6 3.               87
#>  7 pl 1             85
#>  8 2030#luku        66
#>  9 4.               56
#> 10 menrer#mammen    42
#> # ℹ 584 more rows

Note. FORM m2 get´s LEMMA menrer#mammen!?

Nouns, verbs or adjectives with no numbers, but marked as a foreign term. These are mostly nonsense here. We have already gotten rid of real foreign terms by filtering nouns, verbs and adjectives, since foreign terms have UPOSTAG == “X”.

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(!str_detect(FORM, "[0-9]+")) |>  # Note negation operator `!`
  filter(str_detect(FEATS, "Foreign=Yes")) |>  
  count(LEMMA, sort = TRUE)
#> # A tibble: 14 × 2
#>    LEMMA     n
#>    <chr> <int>
#>  1 aa        3
#>  2 ee        2
#>  3 henk      2
#>  4 a         1
#>  5 as        1
#>  6 ea        1
#>  7 eee       1
#>  8 k.        1
#>  9 le        1
#> 10 mer       1
#> 11 pare      1
#> 12 s         1
#> 13 to        1
#> 14 ‘         1

Very common terms in terms of document frequency:

NOTE! Document frequency is easy to get by following dplyr functions:
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "doc_freq")

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(!str_detect(FORM, "[0-9]+")) |>
  filter(!str_detect(FEATS, "Foreign=Yes")) |>
  left_join(
    aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")  # Adds document-frequency to table with dplyr functions
  ) |>
  filter(df > 65) |>
  count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 9 × 2
#>   LEMMA      n
#>   <chr>  <int>
#> 1 asunto  3954
#> 2 vuosi   3683
#> 3 uusi    1505
#> 4 olla    1496
#> 5 )        561
#> 6 .        317
#> 7 se        11
#> 8 ja         5
#> 9 (          1

Very uncommon terms in terms of document frequency:

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(!str_detect(FORM, "[0-9]+")) |>
  filter(!str_detect(FEATS, "Foreign=Yes")) |>
   left_join(
    aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
  ) |>
  filter(df < 5) |>
  count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 14,828 × 2
#>    LEMMA                        n
#>    <chr>                    <int>
#>  1 V                          100
#>  2 kuusamo                     72
#>  3 mikkelinen                  55
#>  4 R                           53
#>  5 W                           50
#>  6 maan#vuokra#sopimus         49
#>  7 kehys#alue                  47
#>  8 kaupungin#kanslia           46
#>  9 S                           43
#> 10 joukko#liikenne#kaupunki    43
#> # ℹ 14,818 more rows

Nouns, verbs or adjectives with numbers removed, foreign (or nonsense) removed, but very short terms still present:

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(!str_detect(FORM, "[0-9]+")) |>
  filter(!str_detect(FEATS, "Foreign=Yes")) |>
  left_join(
    aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
  ) |>
  filter(df >= 5, df <= 65) |>
  filter(nchar(FORM) < 3, nchar(LEMMA) < 4) |>
  count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 65 × 2
#>    LEMMA     n
#>    <chr> <int>
#>  1 oy      452
#>  2 o       286
#>  3 a       159
#>  4 v.      135
#>  5 x       119
#>  6 /        99
#>  7 I        92
#>  8 ..       61
#>  9 -        53
#> 10 A        51
#> # ℹ 55 more rows

Complete pre processing pipeline:

aspol |>
  filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
  filter(!str_detect(FORM, "[0-9]+")) |>
  filter(!str_detect(FEATS, "Foreign=Yes")) |>
  left_join(
    aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
  ) |>
  filter(df >= 5, df <= 65) |>
  filter(nchar(FORM) >= 3, nchar(LEMMA) >= 4)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 182,666 × 14
#>    kunta      sent ID    FORM                    LEMMA                  UPOSTAG XPOSTAG FEATS                                                                      HEAD  DEPREL      DEPS  MISC                       doc                           df
#>    <chr>     <int> <chr> <chr>                   <chr>                  <chr>   <chr>   <chr>                                                                      <chr> <chr>       <chr> <chr>                      <chr>                      <int>
#>  1 Enontekiö     4 1     "KUNTA"                 kunta                  NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n\\r\\n" enontekiö_2017-2025.conllu    64
#>  2 Enontekiö     5 1     "VUOKRA-ASUMISEN"       vuokra#asuminen        NOUN    _       Case=Gen|Derivation=Minen|Number=Sing                                      2     nmod:poss   _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    43
#>  3 Enontekiö     5 2     "KEHITTÄMISSUUNNITELMA" kehittämis#suunnitelma NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    17
#>  4 Enontekiö    19 2     "tehemä"                tehdä                  VERB    _       Mood=Ind|Number=Sing|Person=3|Style=Coll|Tense=Past|VerbForm=Fin|Voice=Act 0     root        _     "_"                        enontekiö_2017-2025.conllu    63
#>  5 Enontekiö    21 1     "\fSisällysluettelo"    sisällys#luettelo      NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    16
#>  6 Enontekiö    21 3     "Johdanto"              johdanto               NOUN    _       Case=Nom|Number=Sing                                                       1     appos       _     "_"                        enontekiö_2017-2025.conllu    34
#>  7 Enontekiö    24 2     "Väestö"                väestö                 NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    54
#>  8 Enontekiö    26 1     "Elinkeinoelämä"        elin#keino#elämä       NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "_"                        enontekiö_2017-2025.conllu    33
#>  9 Enontekiö    26 3     "työpaikat"             työ#paikka             NOUN    _       Case=Nom|Number=Plur                                                       1     conj        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    47
#> 10 Enontekiö    38 1     "Kiinteistöyhtiö"       kiinteistö#yhtiö       NOUN    _       Case=Nom|Number=Sing                                                       4     nsubj:cop   _     "_"                        enontekiö_2017-2025.conllu    12
#> 11 Enontekiö    38 3     "kunnan"                kunta                  NOUN    _       Case=Gen|Number=Sing                                                       4     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    64
#> 12 Enontekiö    39 1     "Taloustilanne"         talous#tilanne         NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "_"                        enontekiö_2017-2025.conllu    11
#> 13 Enontekiö    41 1     "Tulevaisuuden"         tulevaisuus            NOUN    _       Case=Gen|Number=Sing                                                       2     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    56
#> 14 Enontekiö    42 1     "Kiinteistöyhtiö"       kiinteistö#yhtiö       NOUN    _       Case=Nom|Number=Sing                                                       2     compound:nn _     "_"                        enontekiö_2017-2025.conllu    12
#> 15 Enontekiö    43 2     "teette"                tehdä                  VERB    _       Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act            0     root        _     "_"                        enontekiö_2017-2025.conllu    63
#> 16 Enontekiö    44 1     "Taloustilanne"         talous#tilanne         NOUN    _       Case=Nom|Number=Sing                                                       10    nsubj       _     "_"                        enontekiö_2017-2025.conllu    11
#> 17 Enontekiö    46 1     "Tulevaisuuden"         tulevaisuus            NOUN    _       Case=Gen|Number=Sing                                                       2     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    56
#> 18 Enontekiö    46 2     "näkymät"               näkymä                 NOUN    _       Case=Nom|Number=Plur                                                       0     root        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    16
#> 19 Enontekiö    48 1     "Kunnan"                kunta                  NOUN    _       Case=Gen|Number=Sing                                                       2     nsubj       _     "_"                        enontekiö_2017-2025.conllu    64
#> 20 Enontekiö    48 2     "omistamat"             omistaa                VERB    _       Case=Nom|Degree=Pos|Number=Plur|PartForm=Agt|VerbForm=Part|Voice=Act       4     acl         _     "_"                        enontekiö_2017-2025.conllu    55
#> # ℹ 182,646 more rows

Complete preprocessing pipeline as a single function. This is exactly the same as above but easier to type.

aspol |> preprocess_corpus(doc = kunta)
#> # A tibble: 182,564 × 15
#>    kunta      sent ID    FORM                    LEMMA                  UPOSTAG XPOSTAG FEATS                                                                      HEAD  DEPREL      DEPS  MISC                       doc                           df df_ratio
#>    <chr>     <int> <chr> <chr>                   <chr>                  <chr>   <chr>   <chr>                                                                      <chr> <chr>       <chr> <chr>                      <chr>                      <int>    <dbl>
#>  1 Enontekiö     4 1     "KUNTA"                 kunta                  NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n\\r\\n" enontekiö_2017-2025.conllu    64    0.941
#>  2 Enontekiö     5 1     "VUOKRA-ASUMISEN"       vuokra#asuminen        NOUN    _       Case=Gen|Derivation=Minen|Number=Sing                                      2     nmod:poss   _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    43    0.632
#>  3 Enontekiö     5 2     "KEHITTÄMISSUUNNITELMA" kehittämis#suunnitelma NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    17    0.25 
#>  4 Enontekiö    19 2     "tehemä"                tehdä                  VERB    _       Mood=Ind|Number=Sing|Person=3|Style=Coll|Tense=Past|VerbForm=Fin|Voice=Act 0     root        _     "_"                        enontekiö_2017-2025.conllu    63    0.926
#>  5 Enontekiö    21 1     "\fSisällysluettelo"    sisällys#luettelo      NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpacesAfter=\\r\\n"       enontekiö_2017-2025.conllu    16    0.235
#>  6 Enontekiö    21 3     "Johdanto"              johdanto               NOUN    _       Case=Nom|Number=Sing                                                       1     appos       _     "_"                        enontekiö_2017-2025.conllu    34    0.5  
#>  7 Enontekiö    24 2     "Väestö"                väestö                 NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    54    0.794
#>  8 Enontekiö    26 1     "Elinkeinoelämä"        elin#keino#elämä       NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "_"                        enontekiö_2017-2025.conllu    33    0.485
#>  9 Enontekiö    26 3     "työpaikat"             työ#paikka             NOUN    _       Case=Nom|Number=Plur                                                       1     conj        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    47    0.691
#> 10 Enontekiö    38 1     "Kiinteistöyhtiö"       kiinteistö#yhtiö       NOUN    _       Case=Nom|Number=Sing                                                       4     nsubj:cop   _     "_"                        enontekiö_2017-2025.conllu    12    0.176
#> 11 Enontekiö    38 3     "kunnan"                kunta                  NOUN    _       Case=Gen|Number=Sing                                                       4     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    64    0.941
#> 12 Enontekiö    39 1     "Taloustilanne"         talous#tilanne         NOUN    _       Case=Nom|Number=Sing                                                       0     root        _     "_"                        enontekiö_2017-2025.conllu    11    0.162
#> 13 Enontekiö    41 1     "Tulevaisuuden"         tulevaisuus            NOUN    _       Case=Gen|Number=Sing                                                       2     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    56    0.824
#> 14 Enontekiö    42 1     "Kiinteistöyhtiö"       kiinteistö#yhtiö       NOUN    _       Case=Nom|Number=Sing                                                       2     compound:nn _     "_"                        enontekiö_2017-2025.conllu    12    0.176
#> 15 Enontekiö    43 2     "teette"                tehdä                  VERB    _       Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act            0     root        _     "_"                        enontekiö_2017-2025.conllu    63    0.926
#> 16 Enontekiö    44 1     "Taloustilanne"         talous#tilanne         NOUN    _       Case=Nom|Number=Sing                                                       10    nsubj       _     "_"                        enontekiö_2017-2025.conllu    11    0.162
#> 17 Enontekiö    46 1     "Tulevaisuuden"         tulevaisuus            NOUN    _       Case=Gen|Number=Sing                                                       2     nmod:poss   _     "_"                        enontekiö_2017-2025.conllu    56    0.824
#> 18 Enontekiö    46 2     "näkymät"               näkymä                 NOUN    _       Case=Nom|Number=Plur                                                       0     root        _     "SpaceAfter=No"            enontekiö_2017-2025.conllu    16    0.235
#> 19 Enontekiö    48 1     "Kunnan"                kunta                  NOUN    _       Case=Gen|Number=Sing                                                       2     nsubj       _     "_"                        enontekiö_2017-2025.conllu    64    0.941
#> 20 Enontekiö    48 2     "omistamat"             omistaa                VERB    _       Case=Nom|Degree=Pos|Number=Plur|PartForm=Agt|VerbForm=Part|Voice=Act       4     acl         _     "_"                        enontekiö_2017-2025.conllu    55    0.809
#> # ℹ 182,544 more rows

NOTE! Results depend on pre processing steps, details and even order they have been conducted.