Association scores used in collocation analysis and keyword analysis
Source:R/assoc.R
assoc_scores.Rd
assoc_scores
and assoc_abcd
take as their arguments co-occurrence
frequencies of a number of items and return a range of association scores used
in collocation analysis, collostruction analysis and keyword analysis.
Usage
assoc_scores(
x,
y = NULL,
min_freq = 3,
measures = NULL,
with_variants = FALSE,
show_dots = FALSE,
p_fisher_2 = FALSE,
haldane = TRUE,
small_pos = 1e-05
)
assoc_abcd(
a,
b,
c,
d,
types = NULL,
measures = NULL,
with_variants = FALSE,
show_dots = FALSE,
p_fisher_2 = FALSE,
haldane = TRUE,
small_pos = 1e-05
)
Arguments
- x
Either an object of class
freqlist
or an object of classcooc_info
.If
x
is afreqlist
, it is interpreted as the target frequency list (i.e. the list with the frequency of items in the target context) andy
must be afreqlist
with the frequency of items in the reference context.If
x
is an object of classcooc_info
instead, it is interpreted as containing target frequency information, reference frequency information and corpus size information.- y
An object of class
freqlist
with the frequencies of the reference context ifx
is also afreqlist
. Ifx
is an object of classcooc_info
, this argument is ignored.- min_freq
Minimum value for
a[[i]]
(or for the frequency of an item in the target frequency list) needed for its corresponding item to be included in the output.- measures
Character vector containing the association measures (or related quantities) for which scores are requested. Supported measure names (and related quantities) are described in
Value
below.If
measures
isNULL
, it is interpreted as short for the default selection, i.e.c("exp_a", "DP_rows", "RR_rows", "OR", "MS", "Dice", "PMI", "chi2_signed", "G_signed", "t", "fisher")
.If
measures
is"ALL"
, all supported measures are calculated (but not necessarily all the variants; seewith_variants
).- with_variants
Logical. Whether, for the requested
measures
, all variants should be included in the output (TRUE
) or only the main version (FALSE
). See alsop_fisher_2
.- show_dots
Logical. Whether a dot should be shown in console each time calculations for a measure are finished.
- p_fisher_2
Logical. only relevant if
"fisher"
is included inmeasures
. IfTRUE
, the p-value for a two-sided test (testing for either attraction or repulsion) is also calculated. By default, only the (computationally less demanding) p-value for a one-sided test is calculated. SeeValue
for more details.- haldane
Logical. Should the Haldane-Anscombe correction be used? (See the Details section.)
If
haldane
isTRUE
, and there is at least one zero frequency in a contingency table, the correction is used for all measures calculated for that table, not just for measures that need this to be done.- small_pos
Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to
haldane
. The argumentsmall_pos
only applies whenhaldane
is set toFALSE
. (See the Details section.)If
haldane
isFALSE
, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.- a
Numeric vector expressing how many times some tested item occurs in the target context. More specifically,
a[[i]]
, withi
an integer, expresses how many times thei
-th tested item occurs in the target context.- b
Numeric vector expressing how many times other items than the tested item occur in the target context. More specifically,
b[[i]]
, withi
an integer, expresses how many times other items than thei
-th tested item occur in the target context.- c
Numeric vector expressing how many times some tested item occurs in the reference context. More specifically,
c[[i]]
, withi
an integer, expresses how many times thei
-th tested item occurs in the reference context.- d
Numeric vector expressing how many times items other than the tested item occur in the reference context. More specifically,
d[[i]]
, withi
an integer, expresses how many times other items than thei
-th tested item occur in the reference context.- types
A character vector containing the names of the linguistic items of which the association scores are to be calculated, or
NULL
. IfNULL
,assoc_abcd()
creates dummy types such as"t001"
,"t002"
, etc.
Value
An object of class assoc_scores
. This is a kind of data frame with
as its rows all items from either the target frequency list or the reference
frequency list with a frequency larger than min_freq
in the target list,
and as its columns a range of measures that express the extent to which
the items are attracted to the target context (when compared to the reference
context).
Some columns don't contain actual measures but rather additional information
that is useful for interpreting other measures.
Possible columns
The following sections describe the (possible) columns in the output. All
of these measures are reported if measures
is set to "ALL"
. Alternatively,
each measure can be requested by specifying its name in a character vector
given to the measures
argument. Exceptions are described in the sections
below.
Observed and expected frequencies
a
,b
,c
,d
: The frequencies in cells a, b, c and d, respectively. If one of them is0
, they will be augmented by 0.5 orsmall_pos
(seeDetails
). These output columns are always present.dir
: The direction of the association:1
in case of relative attraction between the tested item and the target context (if \(\frac{a}{m} \ge \frac{c}{n}\)) and-1
in case of relative repulsion between the target item and the target context (if \(\frac{a}{m} < {c}{n}\)).exp_a
,exp_b
,exp_c
,exp_d
: The expected values for cells a, b, c and d, respectively. All these columns will be included if"expected"
is inmeasures
.exp_a
is also one of the default measures and is therefore included ifmeasures
isNULL
. The values of these columns are computed as follows:exp_a
= \(\frac{m \times k}{N}\)exp_b
= \(\frac{m \times l}{N}\)exp_c
= \(\frac{n \times k}{N}\)exp_d
= \(\frac{n \times l}{N}\)
Effect size measures
Some of these measures are based on proportions and can therefore be computed either on the rows or on the columns of the contingency table. Each measure can be requested on its own, but pairs of measures can also be requested with the first part of their name, as indicated in their corresponding descriptions.
DP_rows
andDP_cols
: The difference of proportions, sometimes also called Delta-p (\(\Delta p\)), between rows and columns respectively. Both columns are present if"DP"
is included inmeasures
.DP_rows
is also included ifmeasures
isNULL
. They are calculated as follows:DP_rows
= \(\frac{a}{m} - \frac{c}{n}\)DP_cols
= \(\frac{a}{k} - \frac{b}{l}\)
perc_DIFF_rows
andperc_DIFF_cols
: These measures can be seen as normalized versions of Delta-p, i.e. essentially the same measures divided by the denominator and multiplied by100
. They therefore express how large the difference of proportions is, relative to the reference proportion. The multiplication by100
turns the resulting 'relative difference of proportion' into a percentage. Both columns are present if"perc_DIFF"
is included inmeasures
. They are calculated as follows:perc_DIFF_rows
= \(100 * \frac{(a / m) - (c / n)}{c / n}\)perc_DIFF_cols
= \(100 * \frac{(a / k) - (b / l)}{c / n}\)
DC_rows
andDC_cols
: The difference coefficient can be seen as a normalized version of Delta-p, i.e. essentially dividing the difference of proportions by the sum of proportions. Both columns are present if"DC"
is included inmeasures
. They are calculated as follows:DC_rows
= \(\frac{(a / m) - (c / n)}{(a / m) + (c / n)}\)DC_cols
= \(\frac{(a / k) - (b / l)}{(a / k) + (b / l)}\)
RR_rows
andRR_cols
: Relative risk for the rows and columns respectively.RR_rows
represents then how large the proportion in the target context is, relative to the proportion in the reference context. Both columns are present if"RR"
is included inmeasures
.RR_rows
is also included ifmeasures
isNULL
. They are calculated as follows:RR_rows
= \(\frac{a / m}{c / n}\)RR_cols
= \(\frac{a / k}{b / l}\)
LR_rows
andLR_cols
: The so-called 'log ratio' of the rows and columns, respectively. It can be seen as a transformed version of the relative risk, viz. its binary log. Both columns are present if"LR"
is included inmeasures
. They are calculated as follows:LR_rows
= \(\log_2\left(\frac{a / m}{c / n}\right)\)LR_cols
= \(\log_2\left(\frac{a / k}{b / l}\right)\)
Other measures use the contingency table in a different way and therefore
don't have a complementary row/column pair. In order to retrieve these columns,
if measures
is not "ALL"
, their name must be in the measures
vector.
Some of them are included by default, i.e. if measures
is NULL
.
OR
: The odds ratio, which can be calculated either as \(\frac{a/b}{c/d}\) or as \(\frac{a/c}{b/d}\). This column is presentmeasures
isNULL
.log_OR
: The log odds ratio, which can be calculated either as \(\log\left(\frac{a/b}{c/d}\right)\) or as \(\log\left(\frac{a/c}{b/d}\right)\). In other words, it is the natural log of the odds ratio.MS
: The minimum sensitivity, which is calculated as \(\min(\frac{a}{m}, \frac{a}{k})\). In other words, it is either \(\frac{a}{m}\) or \(\frac{a}{k}\), whichever is lowest. This column is presentmeasures
isNULL
.Jaccard
: The Jaccard index, which is calculated as \(\frac{a}{a + b + c}\). It expresses a, which is the frequency of the test item in the target context, relative to b + c + d, i.e. the frequency of all other contexts.Dice
: The Dice coefficient, which is calculated as \(\frac{2a}{m + k}\). It expresses the harmonic mean of \(\frac{a}{m}\) and \(\frac{a}{k}\) This column is presentmeasures
isNULL
.logDice
: An adapted version of the Dice coefficient. It is calculated as \(14 + \log_2\left(\frac{2a}{m + k}\right)\). In other words, it is14
plus the binary log of the Dice coefficient.phi
: The phi coefficient (\(\phi\)), which is calculated as \(\frac{(a \times d) - (b \times c)}{ \sqrt{m \times n \times k \times l}}\).Q
: Yule's Q, which is calculated as \(\frac{(a \times d) - (b \times c)}{(a \times d)(b \times c)}\).mu
: The measure mu (\(\mu\)), which is calculated as \(\frac{a}{\mathrm{exp\_a}}\) (seeexp_a
).PMI
andpos_PMI
: (Positive) pointwise mutual information, which can be seen as a modification of the mu measure and is calculated as \(\log_2\left(\frac{a}{\mathrm{exp\_a}}\right)\). Inpos_PMI
, negative values are set to0
. ThePMI
column is presentmeasures
isNULL
.PMI2
andPMI3
: Modified versions ofPMI
that aim to give relatively more weight to cases with relatively higher a. However, because of this modification, they are not pure effect size measures any more.PMI2
= \(\log_2\left(\frac{a^2}{\mathrm{exp\_a}}\right)\)PMI3
= \(\log_2\left(\frac{a^3}{\mathrm{exp\_a}}\right)\)
Strength of evidence measures
The first measures in this section tend to come in triples: a test statistic,
its p-value (preceded by p_
) and its signed version (followed by _signed
).
The test statistics indicate evidence of either attraction or repulsion.
Thus, in order to indicate the direction of the relationship, a negative
sign is added in the "signed" version when \(\frac{a}{k} < \frac{c}{l}\).
In each of these cases, the name of the main measure (e.g. "chi2"
)
and/or its signed counterpart (e.g. "chi2_signed"
) must be in the measures
argument, or measures
must be "ALL"
, for the columns to be included in
the output. If the main function is requested, the signed counterpart will
also be included, but if only the signed counterpart is requested, the non-signed
version will be excluded.
For the p-value to be retrieved, either the main measure or its signed version
must be requested and, additionally, the with_variants
argument must be
set to TRUE
.
chi2
,p_chi2
andchi2_signed
: The chi-squared test statistic (\(\chi^2\)) as used in a chi-squared test of independence or in a chi-squared test of homogeneity for a two-by-two contingency table. Scores of this measure are high when there is strong evidence for attraction, but also when there is strong evidence for repulsion. Thechi2_signed
column is present ifmeasures
isNULL
.chi2
is calculated as follows: $$ \frac{(a-\mathrm{exp\_a})^2}{\mathrm{exp\_a}} + \frac{(b-\mathrm{exp\_b})^2}{\mathrm{exp\_b}} + \frac{(c-\mathrm{exp\_c})^2}{\mathrm{exp\_c}} + \frac{(d-\mathrm{exp\_d})^2}{\mathrm{exp\_d}} $$.chi2_Y
,p_chi2_Y
andchi2_Y_signed
: The chi-squared test statistic (\(\chi^2\)) as used in a chi-squared test with Yates correction for a two-by-two contingency table.chi2_Y
is calculated as follows: $$ \frac{(|a-\mathrm{exp\_a}| - 0.5)^2}{\mathrm{exp\_a}} + \frac{(|b-\mathrm{exp\_b}| - 0.5)^2}{\mathrm{exp\_b}} + \frac{(|c-\mathrm{exp\_c}| - 0.5)^2}{\mathrm{exp\_c}} + \frac{(|d-\mathrm{exp\_d}| - 0.5)^2}{\mathrm{exp\_d}} $$.chi2_2T
,p_chi2_2T
andchi2_2T_signed
: The chi-squared test statistic (\(\chi^2\)) as used in a chi-squared goodness-of-fit test applied to the first column of the contingency table. The"2T"
in the name stands for 'two terms' (as opposed tochi2
, which is sometimes the 'four terms' version).chi2_2T
is calculated as follows: $$ \frac{(a-\mathrm{exp\_a})^2}{\mathrm{exp\_a}} + \frac{(c-\mathrm{exp\_c})^2}{\mathrm{exp\_c}} $$.chi2_2T_Y
,p_chi2_2T_Y
andchi2_2T_Y_signed
: The chi-squared test statistic (\(\chi^2\)) as used in a chi-squared goodness-of-fit test with Yates correction, applied to the first column of the contingency table.chi2_2T_Y
is calculated as follows: $$ \frac{(|a-\mathrm{exp\_a}| - 0.5)^2}{\mathrm{exp\_a}} + \frac{(|c-\mathrm{exp\_c}| - 0.5)^2}{\mathrm{exp\_c}} $$.G
,p_G
andG_signed
: G test statistic, which is also sometimes called log-likelihood ratio (LLR) and, somewhat confusingly, G-squared. This is the test statistic as used in a log-likelihood ratio test for independence or homogeneity in a two-by-two contingency table. Scores are high in case of strong evidence for attraction, but also in case of strong evidence of repulsion. TheG_signed
column is present ifmeasures
isNULL
.G
is calculated as follows: $$ 2 \left( a \times \log(\frac{a}{\mathrm{exp\_a}}) + b \times \log(\frac{b}{\mathrm{exp\_b}}) + c \times \log(\frac{c}{\mathrm{exp\_c}}) + d \times \log(\frac{d}{\mathrm{exp\_d}}) \right) $$G_2T
,p_G_2T
andG_2T_signed
: The test statistic used in a log-likelihood ratio test for goodness-of-fit applied to the first column of the contingency table. The"2T"
stands for 'two terms'.G_2T
is calculated as follows: $$ 2 \left( a \times \log(\frac{a}{\mathrm{exp\_a}}) + c \times \log(\frac{c}{\mathrm{exp\_c}}) \right) $$
The final two groups of measures take a different shape. The
_as_chisq1
columns compute qchisq(1 - p, 1)
, with p
being the p-values
they are transforming, i.e. the p
right quantile in a \(\chi^2\)
distribution with one degree of freedom (see p_to_chisq1()
).
t
,p_t_1
,t_1_as_chisq1
,p_t_2
andt_2_as_chisq1
: The t-test statistic, used for a t-test for the proportion \(\frac{a}{N}\) in which the null hypothesis is based on \(\frac{k}{N}\times\frac{m}{N}\). Columnt
is present if"t"
is included inmeasures
or ifmeasures
is"ALL"
orNULL
. The other four columns are present ift
is requested and if, additionally,with_variants
isTRUE
.t
= \( \frac{ a/N + k/N + m/N }{ \sqrt{((a/N)\times (1-a/N))/N} } \)p_t_1
is the p-value that corresponds tot
when assuming a one-tailed test that only looks at attraction;t_1_as_chisq1
is its transformation.p_t_2
is the p-value that corresponds tot
when assuming a two-tailed test, viz. that looks at both attraction and repulsion;t_2_as_chisq1
is its transformation.
p_fisher_1
,fisher_1_as_chisq1
,p_fisher_1r
,fisher_1r_as_chisq1
: The p-value of a one-sided Fisher exact test. The columnp_fisher_1
is present if either"fisher"
or"p_fisher"
are inmeasures
or ifmeasures
is"ALL"
orNULL
. The other columns are present ifp_fisher_1
as been requested and if, additionally,with_variants
isTRUE
.p_fisher_1
andp_fisher_1r
are the p-values of the Fisher exact test that look at attraction and repulsion respectively.fisher_1_as_chisq1
andfisher_1r_as_chisq1
are their respective transformations..
p_fisher_2
andfisher_2_as_chisq1
: p-value for a two-sided Fisher exact test, viz. looking at both attraction and repulsion.p_fisher_2
returns the p-value andfisher_2_as_chisq1
is its transformation. Thep_fisher_2
column is present if either"fisher"
or"p_fisher_1"
are inmeasures
or ifmeasures
is"ALL"
orNULL
and if, additionally,p_fisher_2
isTRUE
.fisher_2_as_chisq1
is present ifp_fisher_2
was requested and, additionally,with_variants
isTRUE
.
Properties of the class
An object of class assoc_scores
has:
associated
as.data.frame()
,print()
,sort()
andtibble::as_tibble()
methods,an interactive
explore()
method and useful getters, viz.n_types()
andtype_names()
.
An object of this class can be saved to file with write_assoc()
and read
with read_assoc()
.
Details
Input and output
assoc_scores()
takes as its arguments a target frequency list and a reference
frequency lists (either as two freqlist
objects or as a
cooc_info
object) and returns a number of popular measures
expressing, for (almost) every item in either one of these lists, the extent
to which the item is attracted to the target context, when compared to the
reference context. The "almost" is added between parentheses because, with
the default settings, some items are automatically excluded from the output
(see min_freq
).
assoc_abcd()
takes as its arguments four vectors a
, b
, c
, and d
, of
equal length. Each tuple of values (a[i], b[i], c[i], d[i])
, with i
some
integer number between 1 and the length of the vectors, is assumed to represent
the four numbers a, b, c, d in a contingency table of the type:
tested item | any other item | total | |
target context | a | b | m |
reference context | c | d | n |
total | k | l | N |
In the above table m, n, k, l and N are marginal frequencies. More specifically, m = a + b, n = c + d, k = a + c, l = b + d and N = m + n.
Dealing with zeros
Several of the association measures break down when one or more of the values
a
, b
, c
, and d
are zero (for instance, because this would lead to
division by zero or taking the log of zero). This can be dealt with in different
ways, such as the Haldane-Anscombe correction.
Strictly speaking, Haldane-Anscombe correction specifically applies to the
context of (log) odds ratios for two-by-two tables and boils down to adding
0.5
to each of the four values a
, b
, c
, and d
in every two-by-two contingency table for which the original values
a
, b
, c
, and d
would not allow us to calculate
the (log) odds ratio, which happens when one (or more than one) of the four
cells is zero.
Using the Haldane-Anscombe correction, the (log) odds ratio is then calculated
on the bases of these 'corrected' values for a
, b
, c
, and d
.
However, because other measures that do not compute (log) odds ratios might also break down when some value is zero, all measures will be computed on the 'corrected' contingency matrix.
If the haldane
argument is set to FALSE
, division by zero or taking the
log of zero is avoided by systematically adding a small positive value to all
zero values for a
, b
, c
, and d
. The argument small_pos
determines which small positive value is added in such cases. Its default value is 0.00001
.
Examples
assoc_abcd(10 , 200, 100, 300, types = "four")
#> Association scores (types in list: 1)
#> type a PMI G_signed| b c d dir exp_a DP_rows RR_rows OR
#> 1 four 10 -1.921 -45.432|200 100 300 -1 37.869 -0.202 0.19 0.15
#> <number of extra columns to the right: 5>
#>
assoc_abcd(30, 1000, 14, 5000, types = "fictitious")
#> Association scores (types in list: 1)
#> type a PMI G_signed| b c d dir exp_a DP_rows RR_rows
#> 1 fictitious 30 2 56.959|1000 14 5000 1 7.498 0.026 10.431
#> <number of extra columns to the right: 6>
#>
assoc_abcd(15, 5000, 16, 1000, types = "toy")
#> Association scores (types in list: 1)
#> type a PMI G_signed| b c d dir exp_a DP_rows RR_rows OR
#> 1 toy 15 -0.781 -19.723|5000 16 1000 -1 25.778 -0.013 0.19 0.188
#> <number of extra columns to the right: 5>
#>
assoc_abcd( 1, 300, 4, 6000, types = "examples")
#> Association scores (types in list: 1)
#> type a PMI G_signed| b c d dir exp_a DP_rows RR_rows OR
#> 1 examples 1 2.067 1.473|300 4 6000 1 0.239 0.003 4.987 5
#> <number of extra columns to the right: 5>
#>
a <- c(10, 30, 15, 1)
b <- c(200, 1000, 5000, 300)
c <- c(100, 14, 16, 4)
d <- c(300, 5000, 10000, 6000)
types <- c("four", "fictitious", "toy", "examples")
(scores <- assoc_abcd(a, b, c, d, types = types))
#> Association scores (types in list: 4)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> <number of extra columns to the right: 7>
#>
as_data_frame(scores)
#> type a b c d dir exp_a DP_rows RR_rows OR
#> 1 four 10 200 100 300 -1 37.8688525 -0.202380952 0.1904762 0.15000
#> 2 fictitious 30 1000 14 5000 1 7.4983455 0.026334032 10.4313454 10.71429
#> 3 toy 15 5000 16 10000 1 10.3429579 0.001393583 1.8723829 1.87500
#> 4 examples 1 300 4 6000 1 0.2386994 0.002656037 4.9867110 5.00000
#> MS Dice PMI chi2_signed G_signed t
#> 1 0.047619048 0.062500000 -1.9210117 -38.158009 -45.431519 -8.8860423
#> 2 0.029126214 0.055865922 2.0003183 81.993003 56.958917 4.1184552
#> 3 0.002991027 0.005945303 0.5363137 3.153303 2.983872 1.2030435
#> 4 0.003322259 0.006535948 2.0667329 2.551819 1.473313 0.7613609
#> p_fisher_1
#> 1 1.000000e+00
#> 2 6.106227e-14
#> 3 5.916695e-02
#> 4 2.170331e-01
as_tibble(scores)
#> # A tibble: 4 × 17
#> type a b c d dir exp_a DP_rows RR_rows OR MS
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 four 10 200 100 300 -1 37.9 -0.202 0.190 0.15 0.0476
#> 2 fictitious 30 1000 14 5000 1 7.50 0.0263 10.4 10.7 0.0291
#> 3 toy 15 5000 16 10000 1 10.3 0.00139 1.87 1.88 0.00299
#> 4 examples 1 300 4 6000 1 0.239 0.00266 4.99 5 0.00332
#> # … with 6 more variables: Dice <dbl>, PMI <dbl>, chi2_signed <dbl>,
#> # G_signed <dbl>, t <dbl>, p_fisher_1 <dbl>
print(scores, sort_order = "PMI")
#> Association scores (types in list: 4, sort order criterion: PMI)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> <number of extra columns to the right: 7>
#>
print(scores, sort_order = "alpha")
#> Association scores (types in list: 4, sort order criterion: alpha)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> 4 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> <number of extra columns to the right: 7>
#>
print(scores, sort_order = "none")
#> Association scores (types in list: 4)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> <number of extra columns to the right: 7>
#>
print(scores, sort_order = "nonsense")
#> Association scores (types in list: 4)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> <number of extra columns to the right: 7>
#>
print(scores, sort_order = "PMI",
keep_cols = c("a", "exp_a", "PMI", "G_signed"))
#> Association scores (types in list: 4, sort order criterion: PMI)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> <number of extra columns to the right: 7>
#>
print(scores, sort_order = "PMI",
keep_cols = c("a", "b", "c", "d", "exp_a", "G_signed"))
#> Association scores (types in list: 4, sort order criterion: PMI)
#> type a G_signed| b c d dir exp_a DP_rows RR_rows
#> 1 examples 1 1.473| 300 4 6000 1 0.239 0.003 4.987
#> 2 fictitious 30 56.959|1000 14 5000 1 7.498 0.026 10.431
#> 3 toy 15 2.984|5000 16 10000 1 10.343 0.001 1.872
#> 4 four 10 -45.432| 200 100 300 -1 37.869 -0.202 0.190
#> <number of extra columns to the right: 6>
#>
print(scores, sort_order = "PMI",
drop_cols = c("a", "b", "c", "d", "exp_a", "G_signed",
"RR_rows", "chi2_signed", "t"))
#> Association scores (types in list: 4, sort order criterion: PMI)
#> type a PMI G_signed| b c d dir exp_a DP_rows
#> 1 examples 1 2.067 1.473| 300 4 6000 1 0.239 0.003
#> 2 fictitious 30 2.000 56.959|1000 14 5000 1 7.498 0.026
#> 3 toy 15 0.536 2.984|5000 16 10000 1 10.343 0.001
#> 4 four 10 -1.921 -45.432| 200 100 300 -1 37.869 -0.202
#> <number of extra columns to the right: 7>
#>