Convenience functions in support of regular expressions

These functions are essentially simple wrappers around base R functions such as regexpr(), gregexpr(), grepl(), grep(), sub() and gsub(). The most important differences between the functions documented here and the R base functions is the order of the arguments (x before pattern) and the fact that the argument perl is set to TRUE by default.

Usage

re_retrieve_first(
  x,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  requested_group = NULL,
  drop_NA = FALSE,
  ...
)

re_retrieve_last(
  x,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  requested_group = NULL,
  drop_NA = FALSE,
  ...
)

re_retrieve_all(
  x,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  requested_group = NULL,
  unlist = TRUE,
  ...
)

re_has_matches(
  x,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  ...
)

re_which(
  x,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  ...
)

re_replace_first(
  x,
  pattern,
  replacement,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  ...
)

re_replace_all(
  x,
  pattern,
  replacement,
  ignore.case = FALSE,
  perl = TRUE,
  fixed = FALSE,
  useBytes = FALSE,
  ...
)

Arguments

x: Character vector to be searched or modified.
pattern: Regular expression specifying what is to be searched.
ignore.case: Logical. Should the search be case insensitive?
perl: Logical. Whether the regular expressions use the PCRE flavor of regular expression. Unlike in base R functions, the default is TRUE.
fixed: Logical. If TRUE, pattern is a string to be matched as is, i.e. wildcards and special characters are not interpreted.
useBytes: Logical. If TRUE the matching is done byte-by-byte rather than character-by-character. See 'Details' in grep().
requested_group: Numeric. If NULL or 0, the output will contain matches for pattern as a whole. If another number n is provided, then the output will not contain matches for pattern but instead will only contain the matches for the nth capturing group in pattern (the first if requested_group = 1, the second if requested_group = 2...).
drop_NA: Logical. If FALSE, the output always has the same length as the input x and items that do not contain a match for pattern yield NA. If TRUE, such NA values are removed and therefore the result might contain fewer items than x.
...: Additional arguments.
unlist: Logical. If FALSE, the output always has the same length as the input x. More specifically, the result will be a list in which input items that do not contain a match for pattern yield an empty vector, whereas input items that do match will yield a vector of at least length one (depending on the number of matches). If TRUE, the output is a single vector the length of which may be shorter or longer than x.
replacement: Character vector of length one specifying the replacement string. It is to be taken literally, except that the notation \\1, \\2, etc. can be used to refer to groups in pattern.

Value

re_retrieve_first(), re_retrieve_last() and re_retrieve_all() return either a single vector of character data or a list containing such vectors. re_replace_first() and re_replace_all() return the same type of character vector as x.

re_has_matches() returns a logical vector indicating whether a match was found in each of the elements in x; re_which() returns a numeric vector indicating the indices of the elements of x for which a match was found.

Details

For some of the arguments (e.g. perl, fixed) the reader is directed to base R's regex documentation.

Functions

re_retrieve_first(): Retrieve from each item in x the first match of pattern.
re_retrieve_last(): Retrieve from each item in x the last match of pattern.
re_retrieve_all(): Retrieve from each item in x all matches of pattern.
re_has_matches(): Simple wrapper around grepl().
re_which(): Simple wrapper around grep().
re_replace_first(): Simple wrapper around sub().
re_replace_all(): Simple wrapper around gsub().

Examples

x <- tokenize("This is a sentence with a couple of words in it.")
pattern <- "[oe](.)(.)"

re_retrieve_first(x, pattern)
#>  [1] NA    NA    NA    "ent" NA    NA    "oup" NA    "ord" NA    NA   
re_retrieve_first(x, pattern, drop_NA = TRUE)
#> [1] "ent" "oup" "ord"
re_retrieve_first(x, pattern, requested_group = 1)
#>  [1] NA  NA  NA  "n" NA  NA  "u" NA  "r" NA  NA 
re_retrieve_first(x, pattern, drop_NA = TRUE, requested_group = 1)
#> [1] "n" "u" "r"
re_retrieve_first(x, pattern, requested_group = 2)
#>  [1] NA  NA  NA  "t" NA  NA  "p" NA  "d" NA  NA 

re_retrieve_last(x, pattern)
#>  [1] NA    NA    NA    "enc" NA    NA    "oup" NA    "ord" NA    NA   
re_retrieve_last(x, pattern, drop_NA = TRUE)
#> [1] "enc" "oup" "ord"
re_retrieve_last(x, pattern, requested_group = 1)
#>  [1] NA  NA  NA  "n" NA  NA  "u" NA  "r" NA  NA 
re_retrieve_last(x, pattern, drop_NA = TRUE, requested_group = 1)
#> [1] "n" "u" "r"
re_retrieve_last(x, pattern, requested_group = 2)
#>  [1] NA  NA  NA  "c" NA  NA  "p" NA  "d" NA  NA 

re_retrieve_all(x, pattern)
#> [1] "ent" "enc" "oup" "ord"
re_retrieve_all(x, pattern, unlist = FALSE)
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> character(0)
#> 
#> [[4]]
#> [1] "ent" "enc"
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> character(0)
#> 
#> [[7]]
#> [1] "oup"
#> 
#> [[8]]
#> character(0)
#> 
#> [[9]]
#> [1] "ord"
#> 
#> [[10]]
#> character(0)
#> 
#> [[11]]
#> character(0)
#> 
re_retrieve_all(x, pattern, requested_group = 1)
#> [1] "n" "n" "u" "r"
re_retrieve_all(x, pattern, unlist = FALSE, requested_group = 1)
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> character(0)
#> 
#> [[4]]
#> [1] "n" "n"
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> character(0)
#> 
#> [[7]]
#> [1] "u"
#> 
#> [[8]]
#> character(0)
#> 
#> [[9]]
#> [1] "r"
#> 
#> [[10]]
#> character(0)
#> 
#> [[11]]
#> character(0)
#> 
re_retrieve_all(x, pattern, requested_group = 2)
#> [1] "t" "c" "p" "d"

re_replace_first(x, "([oe].)", "{\\1}")
#> Token sequence of length 11
#> idx      token
#> --- ----------
#>   1       this
#>   2         is
#>   3          a
#>   4 s{en}tence
#>   5       with
#>   6          a
#>   7   c{ou}ple
#>   8       {of}
#>   9    w{or}ds
#>  10         in
#>  11         it
re_replace_all(x, "([oe].)", "{\\1}")
#> Token sequence of length 11
#> idx        token
#> --- ------------
#>   1         this
#>   2           is
#>   3            a
#>   4 s{en}t{en}ce
#>   5         with
#>   6            a
#>   7     c{ou}ple
#>   8         {of}
#>   9      w{or}ds
#>  10           in
#>  11           it