This function conducts a stable lexical marker analysis.
Usage
slma(
x,
y,
file_encoding = "UTF-8",
sig_cutoff = qchisq(0.95, df = 1),
small_pos = 1e-05,
keep_intermediate = FALSE,
verbose = TRUE,
min_rank = 1,
max_rank = 5000,
keeplist = NULL,
stoplist = NULL,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
...
)
Arguments
- x, y
Character vector or
fnames
object with filenames for the two sets of documents.- file_encoding
Encoding of all the files to read.
- sig_cutoff
Numeric value indicating the cutoff value for 'significance in the stable lexical marker analysis. The default value is
qchist(.95, df = 1)
, which is about 3.84.- small_pos
Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to
haldane
. The argumentsmall_pos
only applies whenhaldane
is set toFALSE
. (See the Details section.)If
haldane
isFALSE
, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.- keep_intermediate
Logical. If
TRUE
, results from intermediate calculations are kept in the output as the "intermediate" element. This is necessary if you want to inspect the object with thedetails()
method.- verbose
Logical. Whether progress should be printed to the console during analysis.
- min_rank, max_rank
Minimum and maximum frequency rank in the first corpus (
x
) of the items to take into consideration as candidate stable markers. Only tokens or token n-grams with a frequency rank greater than or equal tomin_rank
and lower than or equal tomax_rank
will be included.- keeplist
List of types that must certainly be included in the list of candidate markers regardless of their frequency rank and of
stoplist
.- stoplist
List of types that must not be included in the list of candidate markers, although, if a type is included in
keeplist
, its inclusion instoplist
is disregarded.- ngram_size
Argument in support of ngrams/skipgrams (see also
max_skip
).If one wants to identify individual tokens, the value of
ngram_size
should beNULL
or1
. If one wants to retrieve token ngrams/skipgrams,ngram_size
should be an integer indicating the size of the ngrams/skipgrams. E.g.2
for bigrams, or3
for trigrams, etc.- max_skip
Argument in support of skipgrams. This argument is ignored if
ngram_size
isNULL
or is1
.If
ngram_size
is2
or higher, andmax_skip
is0
, then regular ngrams are being retrieved (albeit that they may contain open slots; seengram_n_open
).If
ngram_size
is2
or higher, andmax_skip
is1
or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; seengram_n_open
).For instance, if
ngram_size
is3
andmax_skip
is2
, then 2-skip trigrams are being retrieved. Or ifngram_size
is5
andmax_skip
is3
, then 3-skip 5-grams are being retrieved.- ngram_sep
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
- ngram_n_open
If
ngram_size
is2
or higher, and moreoverngram_n_open
is a number higher than0
, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token').For instance, if
ngram_size
is4
andngram_n_open
is1
, and if moreover the input contains a 4-gram"it_is_widely_accepted"
, then the output will contain all modifications of"it_is_widely_accepted"
in which one (sincengram_n_open
is1
) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain"it_[]_widely_accepted"
and"it_is_[]_accepted"
.As a second example, if
ngram_size
is5
andngram_n_open
is2
, and if moreover the input contains a 5-gram"it_is_widely_accepted_that"
, then the output will contain"it_[]_[]_accepted_that"
,"it_[]_widely_[]_that"
, and"it_is_[]_[]_that"
.- ngram_open
Character string used to represent open slots in ngrams in the output of this function.
- ...
Additional arguments.
Value
An object of class slma
, which is a named list with at least the following
elements:
A
scores
dataframe with information about the stability of the chosen lexical items. (See below.)An
intermediate
list with a register of intermediate values ifkeep_intermediate
wasTRUE
.Named items registering the values of the arguments with the same name, namely
sig_cutoff
,small_pos
,x
, andy
.
The slma
object has as_data_frame()
and print
methods
as well as an ad-hoc details()
method. Note that the print
method simply prints the main dataframe.
Contents of the scores
element
The scores
element is a dataframe of which the rows are linguistic items
for which a stable lexical marker analysis was conducted and the columns are
different 'stability measures' and related statistics. By default, the
linguistic items are sorted by decreasing 'stability' according to the S_lor
measure.
Column | Name | Computation | Range of values |
S_abs | Absolute stability | S_att - S_rep | \(-(n*m)\) -- \((n*m)\) |
S_nrm | Normalized stability | S_abs / \(n*m\) | -1 -- 1 |
S_att | Stability of attraction | Number of \((a,b)\) couples in which the linguistic item is a keyword for the A-documents | 0 -- \(n*m\) |
S_rep | Stability of repulsion | Number of \((a,b)\) couples in which the linguistic item is a keyword for the B-documents | 0 -- \(n*m\) |
S_lor | Log of odds ratio stability | Mean of log_OR across all \((a,b)\) couples but setting to 0 the value when p_G is larger than sig_cutoff |
S_lor
is then computed as a fraction with as its numerator the sum of all
log_OR
values across all \((a,b)\) couples for which p_G
is lower than
sig_cutoff
and as its denominator \(n*m\).
For more on log_OR
, see the Value section on on assoc_scores()
. The final
three columns on the output are meant as a tool in support of the interpretation
of the log_OR
column. Considering all \((a,b)\) couples for which
p_G
is smaller than sig_cutoff
, lor_min
, lor_max
and lor_sd
are their minimum, maximum and standard deviation for each element.
Details
A stable lexical marker analysis of the A-documents in x
versus the B-documents
in y
starts from a separate keyword analysis for all possible document couples
\((a,b)\), with a an A-document and b a B-document. If there are n
A-documents and m B-documents, then \(n*m\) keyword analyses are
conducted. The 'stability' of a linguistic item x, as a marker for the
collection of A-documents (when compared to the B-documents) corresponds
to the frequency and consistency with which x is found to be a keyword for
the A-documents across all aforementioned keyword analyses.
In any specific keyword analysis, x is considered a keyword for an A-document
if G_signed
is positive and moreover p_G
is less than sig_cutoff
(see assoc_scores()
for more information on the measures). Item x is
considered a keyword for the B-document if G_signed
is negative and moreover
p_G
is less than sig_cutoff
.
Examples
a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)
#> building global frequency list for x
#> ....
#> building separate frequency lists for each document
#> ....
#> .....
#> calculating assoc scores
#> ....................
#> calculating stability measures
#> done