This function conducts a stable lexical marker analysis.
Usage
slma(
x,
y,
file_encoding = "UTF-8",
sig_cutoff = qchisq(0.95, df = 1),
small_pos = 1e-05,
keep_intermediate = FALSE,
verbose = TRUE,
min_rank = 1,
max_rank = 5000,
keeplist = NULL,
stoplist = NULL,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
...
)Arguments
- x, y
Character vector or
fnamesobject with filenames for the two sets of documents.- file_encoding
Encoding of all the files to read.
- sig_cutoff
Numeric value indicating the cutoff value for 'significance in the stable lexical marker analysis. The default value is
qchist(.95, df = 1), which is about 3.84.- small_pos
Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to
haldane. The argumentsmall_posonly applies whenhaldaneis set toFALSE. (See the Details section.)If
haldaneisFALSE, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.- keep_intermediate
Logical. If
TRUE, results from intermediate calculations are kept in the output as the "intermediate" element. This is necessary if you want to inspect the object with thedetails()method.- verbose
Logical. Whether progress should be printed to the console during analysis.
- min_rank, max_rank
Minimum and maximum frequency rank in the first corpus (
x) of the items to take into consideration as candidate stable markers. Only tokens or token n-grams with a frequency rank greater than or equal tomin_rankand lower than or equal tomax_rankwill be included.- keeplist
List of types that must certainly be included in the list of candidate markers regardless of their frequency rank and of
stoplist.- stoplist
List of types that must not be included in the list of candidate markers, although, if a type is included in
keeplist, its inclusion instoplistis disregarded.- ngram_size
Argument in support of ngrams/skipgrams (see also
max_skip).If one wants to identify individual tokens, the value of
ngram_sizeshould beNULLor1. If one wants to retrieve token ngrams/skipgrams,ngram_sizeshould be an integer indicating the size of the ngrams/skipgrams. E.g.2for bigrams, or3for trigrams, etc.- max_skip
Argument in support of skipgrams. This argument is ignored if
ngram_sizeisNULLor is1.If
ngram_sizeis2or higher, andmax_skipis0, then regular ngrams are being retrieved (albeit that they may contain open slots; seengram_n_open).If
ngram_sizeis2or higher, andmax_skipis1or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; seengram_n_open).For instance, if
ngram_sizeis3andmax_skipis2, then 2-skip trigrams are being retrieved. Or ifngram_sizeis5andmax_skipis3, then 3-skip 5-grams are being retrieved.- ngram_sep
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
- ngram_n_open
If
ngram_sizeis2or higher, and moreoverngram_n_openis a number higher than0, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token').For instance, if
ngram_sizeis4andngram_n_openis1, and if moreover the input contains a 4-gram"it_is_widely_accepted", then the output will contain all modifications of"it_is_widely_accepted"in which one (sincengram_n_openis1) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain"it_[]_widely_accepted"and"it_is_[]_accepted".As a second example, if
ngram_sizeis5andngram_n_openis2, and if moreover the input contains a 5-gram"it_is_widely_accepted_that", then the output will contain"it_[]_[]_accepted_that","it_[]_widely_[]_that", and"it_is_[]_[]_that".- ngram_open
Character string used to represent open slots in ngrams in the output of this function.
- ...
Additional arguments.
Value
An object of class slma, which is a named list with at least the following
elements:
A
scoresdataframe with information about the stability of the chosen lexical items. (See below.)An
intermediatelist with a register of intermediate values ifkeep_intermediatewasTRUE.Named items registering the values of the arguments with the same name, namely
sig_cutoff,small_pos,x, andy.
The slma object has as_data_frame() and print methods
as well as an ad-hoc details() method. Note that the print
method simply prints the main dataframe.
Contents of the scores element
The scores element is a dataframe of which the rows are linguistic items
for which a stable lexical marker analysis was conducted and the columns are
different 'stability measures' and related statistics. By default, the
linguistic items are sorted by decreasing 'stability' according to the S_lor
measure.
| Column | Name | Computation | Range of values |
S_abs | Absolute stability | S_att - S_rep | \(-(n*m)\) -- \((n*m)\) |
S_nrm | Normalized stability | S_abs / \(n*m\) | -1 -- 1 |
S_att | Stability of attraction | Number of \((a,b)\) couples in which the linguistic item is a keyword for the A-documents | 0 -- \(n*m\) |
S_rep | Stability of repulsion | Number of \((a,b)\) couples in which the linguistic item is a keyword for the B-documents | 0 -- \(n*m\) |
S_lor | Log of odds ratio stability | Mean of log_OR across all \((a,b)\) couples but setting to 0 the value when p_G is larger than sig_cutoff |
S_lor is then computed as a fraction with as its numerator the sum of all
log_OR values across all \((a,b)\) couples for which p_G is lower than
sig_cutoff and as its denominator \(n*m\).
For more on log_OR, see the Value section on on assoc_scores(). The final
three columns on the output are meant as a tool in support of the interpretation
of the log_OR column. Considering all \((a,b)\) couples for which
p_G is smaller than sig_cutoff, lor_min, lor_max and lor_sd
are their minimum, maximum and standard deviation for each element.
Details
A stable lexical marker analysis of the A-documents in x versus the B-documents
in y starts from a separate keyword analysis for all possible document couples
\((a,b)\), with a an A-document and b a B-document. If there are n
A-documents and m B-documents, then \(n*m\) keyword analyses are
conducted. The 'stability' of a linguistic item x, as a marker for the
collection of A-documents (when compared to the B-documents) corresponds
to the frequency and consistency with which x is found to be a keyword for
the A-documents across all aforementioned keyword analyses.
In any specific keyword analysis, x is considered a keyword for an A-document
if G_signed is positive and moreover p_G is less than sig_cutoff
(see assoc_scores() for more information on the measures). Item x is
considered a keyword for the B-document if G_signed is negative and moreover
p_G is less than sig_cutoff.
Examples
a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)
#> building global frequency list for x
#> ....
#> building separate frequency lists for each document
#> ....
#> .....
#> calculating assoc scores
#> ....................
#> calculating stability measures
#> done
