tokenize()
splits a text into a sequence of tokens, using regular expressions
to identify them, and returns an object of the class tokens
.
Usage
tokenize(
x,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]"
)
Arguments
- x
Either a character vector or an object of class NLP::TextDocument that contains the text to be tokenized.
- re_drop_line
NULL
or character vector. IfNULL
, it is ignored. Otherwise, a character vector (assumed to be of length 1) containing a regular expression. Lines inx
that contain a match forre_drop_line
are treated as not belonging to the corpus and are excluded from the results.- line_glue
NULL
or character vector. IfNULL
, it is ignored. Otherwise, all lines in a corpus file (or inx
, ifas_text
isTRUE
), are glued together in one character vector of length 1, with the stringline_glue
pasted in between consecutive lines. The value ofline_glue
can also be equal to the empty string""
. The 'line glue' operation is conducted immediately after the 'drop line' operation.- re_cut_area
NULL
or character vector. IfNULL
, it is ignored. Otherwise, all matches in a corpus file (or inx
, ifas_text
isTRUE
), are 'cut out' of the text prior to the identification of the tokens in the text (and are therefore not taken into account when identifying the tokens). The 'cut area' operation is conducted immediately after the 'line glue' operation.- re_token_splitter
Regular expression or
NULL
. Regular expression that identifies the locations where lines in the corpus files are split into tokens. (See Details.)The 'token identification' operation is conducted immediately after the 'cut area' operation.
- re_token_extractor
Regular expression that identifies the locations of the actual tokens. This argument is only used if
re_token_splitter
isNULL
. (See Details.)The 'token identification' operation is conducted immediately after the 'cut area' operation.
- re_drop_token
Regular expression or
NULL
. IfNULL
, it is ignored. Otherwise, it identifies tokens that are to be excluded from the results. Any token that contains a match forre_drop_token
is removed from the results. The 'drop token' operation is conducted immediately after the 'token identification' operation.- re_token_transf_in
Regular expression that identifies areas in the tokens that are to be transformed. This argument works together with the argument
token_transf_out
.If both
re_token_transf_in
andtoken_transf_out
differ fromNA
, then all matches, in the tokens, for the regular expressionre_token_transf_in
are replaced with the replacement stringtoken_transf_out
.The 'token transformation' operation is conducted immediately after the 'drop token' operation.
- token_transf_out
Replacement string. This argument works together with
re_token_transf_in
and is ignored ifre_token_transf_in
isNULL
orNA
.- token_to_lower
Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation.
- perl
Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions.
- ngram_size
Argument in support of ngrams/skipgrams (see also
max_skip
).If one wants to identify individual tokens, the value of
ngram_size
should beNULL
or1
. If one wants to retrieve token ngrams/skipgrams,ngram_size
should be an integer indicating the size of the ngrams/skipgrams. E.g.2
for bigrams, or3
for trigrams, etc.- max_skip
Argument in support of skipgrams. This argument is ignored if
ngram_size
isNULL
or is1
.If
ngram_size
is2
or higher, andmax_skip
is0
, then regular ngrams are being retrieved (albeit that they may contain open slots; seengram_n_open
).If
ngram_size
is2
or higher, andmax_skip
is1
or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; seengram_n_open
).For instance, if
ngram_size
is3
andmax_skip
is2
, then 2-skip trigrams are being retrieved. Or ifngram_size
is5
andmax_skip
is3
, then 3-skip 5-grams are being retrieved.- ngram_sep
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
- ngram_n_open
If
ngram_size
is2
or higher, and moreoverngram_n_open
is a number higher than0
, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token').For instance, if
ngram_size
is4
andngram_n_open
is1
, and if moreover the input contains a 4-gram"it_is_widely_accepted"
, then the output will contain all modifications of"it_is_widely_accepted"
in which one (sincengram_n_open
is1
) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain"it_[]_widely_accepted"
and"it_is_[]_accepted"
.As a second example, if
ngram_size
is5
andngram_n_open
is2
, and if moreover the input contains a 5-gram"it_is_widely_accepted_that"
, then the output will contain"it_[]_[]_accepted_that"
,"it_[]_widely_[]_that"
, and"it_is_[]_[]_that"
.- ngram_open
Character string used to represent open slots in ngrams in the output of this function.
Value
An object of class tokens
, i.e. a sequence of tokens.
It has a number of attributes and method such as:
base
print
,as_data_frame()
,summary()
(which returns the number of items),sort()
andrev()
,an interactive
explore()
method,some getters, namely
n_tokens()
andn_types()
,subsetting methods such as
keep_types()
,keep_pos()
, etc. including[]
subsetting (see brackets).
Additional manipulation functions include the trunc_at()
method to ??,
tokens_merge()
and tokens_merge_all()
to combine token lists and an
as_character()
method to convert to a character vector.
Objects of class tokens
can be saved to file with write_tokens()
;
these files can be read with read_freqlist()
.
Details
If the output contains ngrams with open slots, then the order
of the items in the output is no longer meaningful. For instance, let's imagine
a case where ngram_size
is 5
and ngram_n_open
is 2
.
If the input contains a 5-gram "it_is_widely_accepted_that"
, then the output
will contain "it_[]_[]_accepted_that"
, "it_[]_widely_[]_that"
and
"it_is_[]_[]_that"
. The relative order of these three items in the output
must be considered arbitrary.
Examples
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."
tks <- tokenize(toy_corpus)
print(tks, n = 1000)
#> Token sequence of length 21
#> idx token
#> --- ---------
#> 1 once
#> 2 upon
#> 3 a
#> 4 time
#> 5 there
#> 6 was
#> 7 a
#> 8 tiny
#> 9 toy
#> 10 corpus
#> 11 it
#> 12 consisted
#> 13 of
#> 14 three
#> 15 sentences
#> 16 and
#> 17 it
#> 18 lived
#> 19 happily
#> 20 ever
#> 21 after
tks <- tokenize(toy_corpus, re_token_splitter = "\\W+")
print(tks, n = 1000)
#> Token sequence of length 21
#> idx token
#> --- ---------
#> 1 once
#> 2 upon
#> 3 a
#> 4 time
#> 5 there
#> 6 was
#> 7 a
#> 8 tiny
#> 9 toy
#> 10 corpus
#> 11 it
#> 12 consisted
#> 13 of
#> 14 three
#> 15 sentences
#> 16 and
#> 17 it
#> 18 lived
#> 19 happily
#> 20 ever
#> 21 after
sort(tks)
#> Token sequence of length 21
#> idx token
#> --- ---------
#> 1 a
#> 2 a
#> 3 after
#> 4 and
#> 5 consisted
#> 6 corpus
#> 7 ever
#> 8 happily
#> 9 it
#> 10 it
#> 11 lived
#> 12 of
#> 13 once
#> 14 sentences
#> 15 there
#> 16 three
#> 17 time
#> 18 tiny
#> 19 toy
#> 20 upon
#> ...
#>
summary(tks)
#> Token sequence of length 21
tokenize(toy_corpus, ngram_size = 3)
#> Token sequence of length 19
#> idx token
#> --- -------------------
#> 1 once_upon_a
#> 2 upon_a_time
#> 3 a_time_there
#> 4 time_there_was
#> 5 there_was_a
#> 6 was_a_tiny
#> 7 a_tiny_toy
#> 8 tiny_toy_corpus
#> 9 toy_corpus_it
#> 10 corpus_it_consisted
#> 11 it_consisted_of
#> 12 consisted_of_three
#> 13 of_three_sentences
#> 14 three_sentences_and
#> 15 sentences_and_it
#> 16 and_it_lived
#> 17 it_lived_happily
#> 18 lived_happily_ever
#> 19 happily_ever_after
tokenize(toy_corpus, ngram_size = 3, max_skip = 2)
#> Token sequence of length 106
#> idx token
#> --- ---------------
#> 1 once_upon_a
#> 2 once_upon_time
#> 3 once_upon_there
#> 4 once_a_time
#> 5 once_a_there
#> 6 once_time_there
#> 7 upon_a_time
#> 8 upon_a_there
#> 9 upon_a_was
#> 10 upon_time_there
#> 11 upon_time_was
#> 12 upon_there_was
#> 13 a_time_there
#> 14 a_time_was
#> 15 a_time_a
#> 16 a_there_was
#> 17 a_there_a
#> 18 a_was_a
#> 19 time_there_was
#> 20 time_there_a
#> ...
#>
tokenize(toy_corpus, ngram_size = 3, ngram_n_open = 1)
#> Token sequence of length 19
#> idx token
#> --- -------------------
#> 1 once_[]_a
#> 2 upon_[]_time
#> 3 a_[]_there
#> 4 time_[]_was
#> 5 there_[]_a
#> 6 was_[]_tiny
#> 7 a_[]_toy
#> 8 tiny_[]_corpus
#> 9 toy_[]_it
#> 10 corpus_[]_consisted
#> 11 it_[]_of
#> 12 consisted_[]_three
#> 13 of_[]_sentences
#> 14 three_[]_and
#> 15 sentences_[]_it
#> 16 and_[]_lived
#> 17 it_[]_happily
#> 18 lived_[]_ever
#> 19 happily_[]_after