Examples of pre processing text data with dplyr and
stringr in data frame. Many of the pre process steps
introduced here are possible to carry out after conversion to
document-term-matrix using quanteda package presented in
the next articles.
Aim is to filter out uninteresting, and potentially false terms.
Background for pre processing text data:
* Ristilä A. & K. Elo (2023). Observing political and societal
changes in Finnish parliamentary speech data, 1980–2010, with topic
modelling. Parliaments, Estates and Representation, 43:2,
149–176.
* Matthew J. D. & A. Spirling (2018). Text Preprocessing For
Unsupervised Learning: Why It Matters, When It Misleads, And What To Do
About It. Political Analysis vol. 26:168–189.
Common preprocessing steps include:
1. Select only nouns verbs and adjectives
2. Remove words which contain numbers
3. Filter out foreign terms
4. Drop terms which appear in less than 0.5% of documents (at least 5
docs here) 5. Drop terms which appear in more than 99% of documents (at
most 65 docs here)
6. Additionally we drop very short terms (under 2 character in original
FORM and under 3 characters in base LEMMA)
Original terms total 451660
aspol |> count(LEMMA, sort = TRUE)
#> # A tibble: 34,935 × 2
#> LEMMA n
#> <chr> <int>
#> 1 . 26183
#> 2 ja 17283
#> 3 olla 16098
#> 4 , 15519
#> 5 ! 8029
#> 6 kaupunki 4081
#> 7 ) 4029
#> 8 asunto 3954
#> 9 vuosi 3685
#> 10 ( 3408
#> # ℹ 34,925 more rowsLet’s take a closer look what we are throwing away while carrying out our pre processing pipeline.
Filtering nouns, verbs and adjectives drop following terms:
aspol |> filter(!UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |> # Note negation operator `!` here
count(LEMMA, sort = TRUE)
#> # A tibble: 16,812 × 2
#> LEMMA n
#> <chr> <int>
#> 1 . 25866
#> 2 ja 17278
#> 3 , 15519
#> 4 olla 14602
#> 5 ! 8028
#> 6 ) 3468
#> 7 ( 3407
#> 8 % 3066
#> 9 joka 2710
#> 10 sekä 2307
#> # ℹ 16,802 more rowsNow we filter nouns, verbs and adjectives.Which of those words contain numbers (in their original FORM at least).
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(str_detect(FORM, "[0-9]+")) |>
count(LEMMA, sort = TRUE)
#> # A tibble: 594 × 2
#> LEMMA n
#> <chr> <int>
#> 1 1. 124
#> 2 2000#luku 110
#> 3 2010#luku 107
#> 4 m2 90
#> 5 2. 87
#> 6 3. 87
#> 7 pl 1 85
#> 8 2030#luku 66
#> 9 4. 56
#> 10 menrer#mammen 42
#> # ℹ 584 more rowsNote. FORM
m2get´s LEMMAmenrer#mammen!?
Nouns, verbs or adjectives with no numbers, but marked as a foreign term. These are mostly nonsense here. We have already gotten rid of real foreign terms by filtering nouns, verbs and adjectives, since foreign terms have UPOSTAG == “X”.
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(!str_detect(FORM, "[0-9]+")) |> # Note negation operator `!`
filter(str_detect(FEATS, "Foreign=Yes")) |>
count(LEMMA, sort = TRUE)
#> # A tibble: 14 × 2
#> LEMMA n
#> <chr> <int>
#> 1 aa 3
#> 2 ee 2
#> 3 henk 2
#> 4 a 1
#> 5 as 1
#> 6 ea 1
#> 7 eee 1
#> 8 k. 1
#> 9 le 1
#> 10 mer 1
#> 11 pare 1
#> 12 s 1
#> 13 to 1
#> 14 ‘ 1Very common terms in terms of document frequency:
NOTE! Document frequency is easy to get by following dplyr functions:
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "doc_freq")
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(!str_detect(FORM, "[0-9]+")) |>
filter(!str_detect(FEATS, "Foreign=Yes")) |>
left_join(
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df") # Adds document-frequency to table with dplyr functions
) |>
filter(df > 65) |>
count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 9 × 2
#> LEMMA n
#> <chr> <int>
#> 1 asunto 3954
#> 2 vuosi 3683
#> 3 uusi 1505
#> 4 olla 1496
#> 5 ) 561
#> 6 . 317
#> 7 se 11
#> 8 ja 5
#> 9 ( 1Very uncommon terms in terms of document frequency:
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(!str_detect(FORM, "[0-9]+")) |>
filter(!str_detect(FEATS, "Foreign=Yes")) |>
left_join(
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
) |>
filter(df < 5) |>
count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 14,828 × 2
#> LEMMA n
#> <chr> <int>
#> 1 V 100
#> 2 kuusamo 72
#> 3 mikkelinen 55
#> 4 R 53
#> 5 W 50
#> 6 maan#vuokra#sopimus 49
#> 7 kehys#alue 47
#> 8 kaupungin#kanslia 46
#> 9 S 43
#> 10 joukko#liikenne#kaupunki 43
#> # ℹ 14,818 more rowsNouns, verbs or adjectives with numbers removed, foreign (or nonsense) removed, but very short terms still present:
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(!str_detect(FORM, "[0-9]+")) |>
filter(!str_detect(FEATS, "Foreign=Yes")) |>
left_join(
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
) |>
filter(df >= 5, df <= 65) |>
filter(nchar(FORM) < 3, nchar(LEMMA) < 4) |>
count(LEMMA, sort = TRUE)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 65 × 2
#> LEMMA n
#> <chr> <int>
#> 1 oy 452
#> 2 o 286
#> 3 a 159
#> 4 v. 135
#> 5 x 119
#> 6 / 99
#> 7 I 92
#> 8 .. 61
#> 9 - 53
#> 10 A 51
#> # ℹ 55 more rowsComplete pre processing pipeline:
aspol |>
filter(UPOSTAG %in% c("NOUN", "VERB", "ADJ")) |>
filter(!str_detect(FORM, "[0-9]+")) |>
filter(!str_detect(FEATS, "Foreign=Yes")) |>
left_join(
aspol |> distinct(kunta, LEMMA) |> count(LEMMA, name = "df")
) |>
filter(df >= 5, df <= 65) |>
filter(nchar(FORM) >= 3, nchar(LEMMA) >= 4)
#> Joining with `by = join_by(LEMMA)`
#> # A tibble: 182,666 × 14
#> kunta sent ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC doc df
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Enontekiö 4 1 "KUNTA" kunta NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n\\r\\n" enontekiö_2017-2025.conllu 64
#> 2 Enontekiö 5 1 "VUOKRA-ASUMISEN" vuokra#asuminen NOUN _ Case=Gen|Derivation=Minen|Number=Sing 2 nmod:poss _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 43
#> 3 Enontekiö 5 2 "KEHITTÄMISSUUNNITELMA" kehittämis#suunnitelma NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 17
#> 4 Enontekiö 19 2 "tehemä" tehdä VERB _ Mood=Ind|Number=Sing|Person=3|Style=Coll|Tense=Past|VerbForm=Fin|Voice=Act 0 root _ "_" enontekiö_2017-2025.conllu 63
#> 5 Enontekiö 21 1 "\fSisällysluettelo" sisällys#luettelo NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 16
#> 6 Enontekiö 21 3 "Johdanto" johdanto NOUN _ Case=Nom|Number=Sing 1 appos _ "_" enontekiö_2017-2025.conllu 34
#> 7 Enontekiö 24 2 "Väestö" väestö NOUN _ Case=Nom|Number=Sing 0 root _ "SpaceAfter=No" enontekiö_2017-2025.conllu 54
#> 8 Enontekiö 26 1 "Elinkeinoelämä" elin#keino#elämä NOUN _ Case=Nom|Number=Sing 0 root _ "_" enontekiö_2017-2025.conllu 33
#> 9 Enontekiö 26 3 "työpaikat" työ#paikka NOUN _ Case=Nom|Number=Plur 1 conj _ "SpaceAfter=No" enontekiö_2017-2025.conllu 47
#> 10 Enontekiö 38 1 "Kiinteistöyhtiö" kiinteistö#yhtiö NOUN _ Case=Nom|Number=Sing 4 nsubj:cop _ "_" enontekiö_2017-2025.conllu 12
#> 11 Enontekiö 38 3 "kunnan" kunta NOUN _ Case=Gen|Number=Sing 4 nmod:poss _ "_" enontekiö_2017-2025.conllu 64
#> 12 Enontekiö 39 1 "Taloustilanne" talous#tilanne NOUN _ Case=Nom|Number=Sing 0 root _ "_" enontekiö_2017-2025.conllu 11
#> 13 Enontekiö 41 1 "Tulevaisuuden" tulevaisuus NOUN _ Case=Gen|Number=Sing 2 nmod:poss _ "_" enontekiö_2017-2025.conllu 56
#> 14 Enontekiö 42 1 "Kiinteistöyhtiö" kiinteistö#yhtiö NOUN _ Case=Nom|Number=Sing 2 compound:nn _ "_" enontekiö_2017-2025.conllu 12
#> 15 Enontekiö 43 2 "teette" tehdä VERB _ Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ "_" enontekiö_2017-2025.conllu 63
#> 16 Enontekiö 44 1 "Taloustilanne" talous#tilanne NOUN _ Case=Nom|Number=Sing 10 nsubj _ "_" enontekiö_2017-2025.conllu 11
#> 17 Enontekiö 46 1 "Tulevaisuuden" tulevaisuus NOUN _ Case=Gen|Number=Sing 2 nmod:poss _ "_" enontekiö_2017-2025.conllu 56
#> 18 Enontekiö 46 2 "näkymät" näkymä NOUN _ Case=Nom|Number=Plur 0 root _ "SpaceAfter=No" enontekiö_2017-2025.conllu 16
#> 19 Enontekiö 48 1 "Kunnan" kunta NOUN _ Case=Gen|Number=Sing 2 nsubj _ "_" enontekiö_2017-2025.conllu 64
#> 20 Enontekiö 48 2 "omistamat" omistaa VERB _ Case=Nom|Degree=Pos|Number=Plur|PartForm=Agt|VerbForm=Part|Voice=Act 4 acl _ "_" enontekiö_2017-2025.conllu 55
#> # ℹ 182,646 more rowsComplete preprocessing pipeline as a single function. This is exactly the same as above but easier to type.
aspol |> preprocess_corpus(doc = kunta)
#> # A tibble: 182,564 × 15
#> kunta sent ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC doc df df_ratio
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 Enontekiö 4 1 "KUNTA" kunta NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n\\r\\n" enontekiö_2017-2025.conllu 64 0.941
#> 2 Enontekiö 5 1 "VUOKRA-ASUMISEN" vuokra#asuminen NOUN _ Case=Gen|Derivation=Minen|Number=Sing 2 nmod:poss _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 43 0.632
#> 3 Enontekiö 5 2 "KEHITTÄMISSUUNNITELMA" kehittämis#suunnitelma NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 17 0.25
#> 4 Enontekiö 19 2 "tehemä" tehdä VERB _ Mood=Ind|Number=Sing|Person=3|Style=Coll|Tense=Past|VerbForm=Fin|Voice=Act 0 root _ "_" enontekiö_2017-2025.conllu 63 0.926
#> 5 Enontekiö 21 1 "\fSisällysluettelo" sisällys#luettelo NOUN _ Case=Nom|Number=Sing 0 root _ "SpacesAfter=\\r\\n" enontekiö_2017-2025.conllu 16 0.235
#> 6 Enontekiö 21 3 "Johdanto" johdanto NOUN _ Case=Nom|Number=Sing 1 appos _ "_" enontekiö_2017-2025.conllu 34 0.5
#> 7 Enontekiö 24 2 "Väestö" väestö NOUN _ Case=Nom|Number=Sing 0 root _ "SpaceAfter=No" enontekiö_2017-2025.conllu 54 0.794
#> 8 Enontekiö 26 1 "Elinkeinoelämä" elin#keino#elämä NOUN _ Case=Nom|Number=Sing 0 root _ "_" enontekiö_2017-2025.conllu 33 0.485
#> 9 Enontekiö 26 3 "työpaikat" työ#paikka NOUN _ Case=Nom|Number=Plur 1 conj _ "SpaceAfter=No" enontekiö_2017-2025.conllu 47 0.691
#> 10 Enontekiö 38 1 "Kiinteistöyhtiö" kiinteistö#yhtiö NOUN _ Case=Nom|Number=Sing 4 nsubj:cop _ "_" enontekiö_2017-2025.conllu 12 0.176
#> 11 Enontekiö 38 3 "kunnan" kunta NOUN _ Case=Gen|Number=Sing 4 nmod:poss _ "_" enontekiö_2017-2025.conllu 64 0.941
#> 12 Enontekiö 39 1 "Taloustilanne" talous#tilanne NOUN _ Case=Nom|Number=Sing 0 root _ "_" enontekiö_2017-2025.conllu 11 0.162
#> 13 Enontekiö 41 1 "Tulevaisuuden" tulevaisuus NOUN _ Case=Gen|Number=Sing 2 nmod:poss _ "_" enontekiö_2017-2025.conllu 56 0.824
#> 14 Enontekiö 42 1 "Kiinteistöyhtiö" kiinteistö#yhtiö NOUN _ Case=Nom|Number=Sing 2 compound:nn _ "_" enontekiö_2017-2025.conllu 12 0.176
#> 15 Enontekiö 43 2 "teette" tehdä VERB _ Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ "_" enontekiö_2017-2025.conllu 63 0.926
#> 16 Enontekiö 44 1 "Taloustilanne" talous#tilanne NOUN _ Case=Nom|Number=Sing 10 nsubj _ "_" enontekiö_2017-2025.conllu 11 0.162
#> 17 Enontekiö 46 1 "Tulevaisuuden" tulevaisuus NOUN _ Case=Gen|Number=Sing 2 nmod:poss _ "_" enontekiö_2017-2025.conllu 56 0.824
#> 18 Enontekiö 46 2 "näkymät" näkymä NOUN _ Case=Nom|Number=Plur 0 root _ "SpaceAfter=No" enontekiö_2017-2025.conllu 16 0.235
#> 19 Enontekiö 48 1 "Kunnan" kunta NOUN _ Case=Gen|Number=Sing 2 nsubj _ "_" enontekiö_2017-2025.conllu 64 0.941
#> 20 Enontekiö 48 2 "omistamat" omistaa VERB _ Case=Nom|Degree=Pos|Number=Plur|PartForm=Agt|VerbForm=Part|Voice=Act 4 acl _ "_" enontekiö_2017-2025.conllu 55 0.809
#> # ℹ 182,544 more rowsNOTE! Results depend on pre processing steps, details and even order they have been conducted.