Stable lexical marker analysis

This function conducts a stable lexical marker analysis.

Usage

slma(
  x,
  y,
  file_encoding = "UTF-8",
  sig_cutoff = qchisq(0.95, df = 1),
  small_pos = 1e-05,
  keep_intermediate = FALSE,
  verbose = TRUE,
  min_rank = 1,
  max_rank = 5000,
  keeplist = NULL,
  stoplist = NULL,
  ngram_size = NULL,
  max_skip = 0,
  ngram_sep = "_",
  ngram_n_open = 0,
  ngram_open = "[]",
  ...
)

Arguments

x, y

Character vector or fnames object with filenames for the two sets of documents.

file_encoding

Encoding of all the files to read.

sig_cutoff

Numeric value indicating the cutoff value for 'significance in the stable lexical marker analysis. The default value is qchist(.95, df = 1), which is about 3.84.

small_pos

Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to haldane. The argument small_pos only applies when haldane is set to FALSE. (See the Details section.)

If haldane is FALSE, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.

keep_intermediate

Logical. If TRUE, results from intermediate calculations are kept in the output as the "intermediate" element. This is necessary if you want to inspect the object with the details() method.

verbose

Logical. Whether progress should be printed to the console during analysis.

min_rank, max_rank

Minimum and maximum frequency rank in the first corpus (x) of the items to take into consideration as candidate stable markers. Only tokens or token n-grams with a frequency rank greater than or equal to min_rank and lower than or equal to max_rank will be included.

keeplist

List of types that must certainly be included in the list of candidate markers regardless of their frequency rank and of stoplist.

stoplist

List of types that must not be included in the list of candidate markers, although, if a type is included in keeplist, its inclusion in stoplist is disregarded.

ngram_size

Argument in support of ngrams/skipgrams (see also max_skip).

If one wants to identify individual tokens, the value of ngram_size should be NULL or 1. If one wants to retrieve token ngrams/skipgrams, ngram_size should be an integer indicating the size of the ngrams/skipgrams. E.g. 2 for bigrams, or 3 for trigrams, etc.

max_skip

Argument in support of skipgrams. This argument is ignored if ngram_size is NULL or is 1.

If ngram_size is 2 or higher, and max_skip is 0, then regular ngrams are being retrieved (albeit that they may contain open slots; see ngram_n_open).

If ngram_size is 2 or higher, and max_skip is 1 or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; see ngram_n_open).

For instance, if ngram_size is 3 and max_skip is 2, then 2-skip trigrams are being retrieved. Or if ngram_size is 5 and max_skip is 3, then 3-skip 5-grams are being retrieved.

ngram_sep

Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.

ngram_n_open

If ngram_size is 2 or higher, and moreover ngram_n_open is a number higher than 0, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token').

For instance, if ngram_size is 4 and ngram_n_open is 1, and if moreover the input contains a 4-gram "it_is_widely_accepted", then the output will contain all modifications of "it_is_widely_accepted" in which one (since ngram_n_open is 1) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain "it_[]_widely_accepted" and "it_is_[]_accepted".

As a second example, if ngram_size is 5 and ngram_n_open is 2, and if moreover the input contains a 5-gram "it_is_widely_accepted_that", then the output will contain "it_[]_[]_accepted_that", "it_[]_widely_[]_that", and "it_is_[]_[]_that".

ngram_open

Character string used to represent open slots in ngrams in the output of this function.

...

Additional arguments.

Value

An object of class slma, which is a named list with at least the following elements:

A scores dataframe with information about the stability of the chosen lexical items. (See below.)
An intermediate list with a register of intermediate values if keep_intermediate was TRUE.
Named items registering the values of the arguments with the same name, namely sig_cutoff, small_pos, x, and y.

The slma object has as_data_frame() and print methods as well as an ad-hoc details() method. Note that the print

method simply prints the main dataframe.

Contents of the `scores` element

The scores element is a dataframe of which the rows are linguistic items for which a stable lexical marker analysis was conducted and the columns are different 'stability measures' and related statistics. By default, the linguistic items are sorted by decreasing 'stability' according to the S_lor measure.

Column	Name	Computation	Range of values
`S_abs`	Absolute stability	`S_att` - `S_rep`	\(-(nm)\) -- \((nm)\)
`S_nrm`	Normalized stability	`S_abs` / \(n*m\)	-1 -- 1
`S_att`	Stability of attraction	Number of \((a,b)\) couples in which the linguistic item is a keyword for the A-documents	0 -- \(n*m\)
`S_rep`	Stability of repulsion	Number of \((a,b)\) couples in which the linguistic item is a keyword for the B-documents	0 -- \(n*m\)
`S_lor`	Log of odds ratio stability	Mean of `log_OR` across all \((a,b)\) couples but setting to 0 the value when `p_G` is larger than `sig_cutoff`

S_lor is then computed as a fraction with as its numerator the sum of all log_OR values across all \((a,b)\) couples for which p_G is lower than sig_cutoff and as its denominator \(n*m\). For more on log_OR, see the Value section on on assoc_scores(). The final three columns on the output are meant as a tool in support of the interpretation of the log_OR column. Considering all \((a,b)\) couples for which p_G is smaller than sig_cutoff, lor_min, lor_max and lor_sd are their minimum, maximum and standard deviation for each element.

Details

A stable lexical marker analysis of the A-documents in x versus the B-documents in y starts from a separate keyword analysis for all possible document couples \((a,b)\), with a an A-document and b a B-document. If there are n A-documents and m B-documents, then \(n*m\) keyword analyses are conducted. The 'stability' of a linguistic item x, as a marker for the collection of A-documents (when compared to the B-documents) corresponds to the frequency and consistency with which x is found to be a keyword for the A-documents across all aforementioned keyword analyses.

In any specific keyword analysis, x is considered a keyword for an A-document if G_signed is positive and moreover p_G is less than sig_cutoff (see assoc_scores() for more information on the measures). Item x is considered a keyword for the B-document if G_signed is negative and moreover p_G is less than sig_cutoff.

Examples

a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)
#> building global frequency list for x
#> ....
#> building separate frequency lists for each document
#> ....
#> .....
#> calculating assoc scores
#> ....................
#> calculating stability measures
#> done