Build a concordance for the matches of a regex

This function builds a concordance for the matches of a regular expression. The result is a dataset that can be written to a file with the function write_conc(). It mimics the behavior of the concordance tool in the program AntConc.

Usage

conc(
  x,
  pattern,
  c_left = 200,
  c_right = 200,
  perl = TRUE,
  re_drop_line = NULL,
  line_glue = "\n",
  re_cut_area = NULL,
  file_encoding = "UTF-8",
  as_text = FALSE
)

Arguments

x

A character vector determining which text is to be used as corpus.

If as_text = TRUE, x is treated as the actual text to be used as corpus.

If as_text = FALSE (the default), x is treated as a vector of filenames, interpreted as the names of the corpus files that contain the actual corpus data.

pattern

Character string containing the regular expression that serves as search term for the concordancer.

c_left

Number. How many characters to the left of each match must be included in the result as left co-text of the match.

c_right

Number. How many characters to the right of each match must be included in the result as right co-text of the match.

perl

If TRUE, pattern is treated as a PCRE flavor regular expression. Otherwise, pattern is treated as a regular expression in R's default flavor of regular expression.

re_drop_line

Character vector or NULL. If NULL, the argument is ignored. Otherwise, lines in x containing a match for re_drop_line are treated as not belonging to the corpus and are excluded from the results.

line_glue

Character vector or NULL. If NULL, the argument is ignored. Otherwise, all lines in the corpus are glued together in one character vector of length 1, with the string line_glue pasted in between consecutive lines. The value of line_glue can also be equal to the empty string (""). The 'line_glue' operation is conducted immediately after the 'drop line' operation.

re_cut_area

Character vector or NULL. If NULL, the argument is ignored. Otherwise, all matches in the corpus are 'cut out' of the text prior to the identification of the tokens in the text (and are therefore not taken into account when identifying tokens). The 'cut area' operation is conducted immediately after the 'line glue' operation.

file_encoding

File encoding for reading each corpus file. Ignored if as_text = TRUE. Otherwise, it must be a character vector of length one (in which case the same encoding is used for all files) or with the same length as x (in which case each file can have a different encoding).

as_text

Logical. If TRUE, the content of x is treated as the actual text of the corpus (with each item within x treated as a separate 'document in RAM').

If FALSE, x is treated as a vector of filenames, interpreted as the names of the corpus files with the actual corpus data.

Value

Object of class conc, a kind of data frame with as its rows the matches and with the following columns:

glob_id: Number indicating the position of the match in the overall list of matches.
id: Number indicating the position of the match in the list of matches for one specific query.
source: Either the filename of the file in which the match was found (in case of the setting as_text = FALSE), or the string '-' (in case of the setting as_text = TRUE).
left: The left-hand side co-text of each match.
match: The actual match.
right: The right-hand side co-text of each match.

It also has additional attributes and methods such as:

base as_data_frame() and print() methods, as well as a print_kwic() function,
an explore() method.

An object of class conc can be merged with another by means of merge_conc(). It can be written to file with write_conc() and then read with read_conc(). It is also possible to import concordances created by means other than write_conc() with import_conc().

Details

In order to make sure that the columns left, match, and right in the output of conc do not contain any TAB or NEWLINE characters, whitespace in these items is being 'normalized'. More particularly, each stretch of whitespace, i.e. each uninterrupted sequences of whitespace characters, is replaced by a single SPACE character.

The values in the items the glob_id and id in the output of conc are always identical in a dataset that is the output of the function conc. The item glob_id only becomes useful when later, for instance, one wants to merge two datasets.#'

Examples

(conc_data <- conc('A very small corpus.', '\\w+', as_text = TRUE))
#> Concordance-based data frame (number of observations: 4)
#> idx                                           left|match |right             
#>   1                                               |  A   |very small corpus.
#>   2                                              A| very |small corpus.     
#>   3                                         A very|small |corpus.           
#>   4                                   A very small|corpus|.                 
#> 
#> This data frame has 6 columns:
#>    column
#> 1 glob_id
#> 2      id
#> 3  source
#> 4    left
#> 5   match
#> 6   right
print(conc_data)
#> Concordance-based data frame (number of observations: 4)
#> idx                                           left|match |right             
#>   1                                               |  A   |very small corpus.
#>   2                                              A| very |small corpus.     
#>   3                                         A very|small |corpus.           
#>   4                                   A very small|corpus|.                 
#> 
#> This data frame has 6 columns:
#>    column
#> 1 glob_id
#> 2      id
#> 3  source
#> 4    left
#> 5   match
#> 6   right
print_kwic(conc_data)
#> idx                                           left|match |right             
#>   1                                               |  A   |very small corpus.
#>   2                                              A| very |small corpus.     
#>   3                                         A very|small |corpus.           
#>   4                                   A very small|corpus|.