Function Reference · DSUtils.jl

DSUtils.BCDiag
DSUtils.accuracyplot
DSUtils.auroc
DSUtils.bcdiag
DSUtils.biasplot
DSUtils.concordance
DSUtils.cumliftable
DSUtils.cumliftcurve
DSUtils.infovalue
DSUtils.ksplot
DSUtils.kstest
DSUtils.liftable
DSUtils.liftcurve
DSUtils.onehot
DSUtils.onehot!
DSUtils.onehot!
DSUtils.ranks
DSUtils.rocplot
DSUtils.strstd
DSUtils.sumxm

DSUtils.BCDiag — Type

BCDiag

A structure of diagnostic properties of a binary Classifier. Facilitates summary plots and tables.

source

DSUtils.accuracyplot — Method

accuracyplot(x::BCDiag; util=[1, 0, 0, 1])

Using util values for [TP, FN, FP, TN], produce accuracy plot and its [max, argmax, argdep]. Default util values of [1, 0, 0, 1] gives the standard accuracy value of (TP+TN)/N.

source

DSUtils.auroc — Method

auroc(class, var; tie = 1e-6)

Calculate area under Receiver Operating Characteristics (ROC) curve, class is a 2 level categorical variable, var is the distribution to analyze. Pair-wise comparison between class 1 values with class 0 values are made as follows:

class 1 value > class 0 value is Concordant
class 1 value ≈ class 0 value (within tie) is Tied
class 1 value < class 0 value is Discordant

Returns:

concordant, number of concordant comparisons
tied, number of tied comparisons
discordant, number of discordant comparisons
auc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons

Note there are other rank order measures:

Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant), no penalty for ties
Kendall's Tau is (Concordant - Discordant) / (0.5 N(N-1))

source

DSUtils.bcdiag — Method

bcdiag(target, pred; groups = 100, rev = true, tie = 1e-6)

Perform diagnostics of a binary classifier. target is a 2 level categorical variable, pred is probability of class 1. groups is the number of bins to use for plotting/printing. rev = true orders pred from high to low. tie is the tolerance of pred where values are considered tied.

Returns a BCDiag struct which can be used for plotting or printing:

biasplot is calibration plot of target response rate vs. pred response rate
ksplot produces ksplot of cumulative distributions
rocplot plots the Receiver Operating Characteristics curve
accuracyplot plots the accuracy curve with adjustable utility
liftcurve is the lift curve
cumliftsurve is the cumulative lift surve
liftable is the lift table as a DataFrame
cumliftable is the cumulative lift table as a DataFrame

source

DSUtils.biasplot — Method

biasplot(x::BCDiag)

returns a bias calibration plot of x - actual response vs. predicted response

source

DSUtils.concordance — Method

concordance(class, var, tie)

Concordance calculation with flexible tied region (auroc uses fixed width region). class is a 2 level categorical variable, var is the distribution to analyze, tie(x) returns the lower and upper bound of tied region of x.

Pair-wise comparison between class 1 values with class 0 values are made as follows: class 1 value > class 0 value is Concordant; class 1 value ≈ class 0 value (within tie) is Tied; class 1 value < class 0 value is Discordant.

Returns:

concordant, number of concordant comparisons
tied, number of tied comparisons
discordant, number of discordant comparisons
auroc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons

Note Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant); and Kendall's Tau is (Concordant - Discordant) / (0.5 x Total count x (Total count - 1))

source

DSUtils.cumliftable — Method

cumliftable(x::BCDiag)

returns a cumulative lift table of x as a DataFrame

source

DSUtils.cumliftcurve — Method

cumliftcurve(x::BCDiag)

returns a cumulative lift curve plot of x - cumulative actual and predicted vs. depth

source

DSUtils.infovalue — Method

infovalue(g::Vector{Integer}, b::Vector{Integer}))

Information value calculation of g, b vector of binned frequency counts

weight of evidence = log(density g / density b), 0 adjusted
infovalue = sum (density g - density b) * weight of evidence

Industry rule of thumb:

iv <= 0.1 no significant change
0.1 < iv <= 0.25 minor change
0.25 < iv major change

source

DSUtils.ksplot — Method

ksplot(x::BCDiag)

returns a KS plot of x - CDF1 (True Positive) and CDF0 (False Positive) versus depth

source

DSUtils.kstest — Method

kstest(class, var; rev = true)

Calculate empirical 2 sample Kolmogorov-Smirnov statistic and its location. class is a 2 level categorical variable, var is the distribution to analyze.

Returns:

n, total number of observations
n1, number of observations of class 1
n0, number of observations of class 0
baserate, incidence rate of class 1
ks, the maximum separation between the two cumulative distributions
ksarg, the value of var at which maximum separation is achieved
ksdep, depth of ksarg in the sorted values of var;

rev = true counts depth from high value towards low value.

source

DSUtils.liftable — Method

liftable(x::BCDiag)

returns a lift table of x as a DataFrame

source

DSUtils.liftcurve — Method

liftcurve(x::BCDiag)

returns a lift curve plot of x - actual and predicted versus depth

source

DSUtils.onehot! — Method

onehot!(df::AbstractDataFrame, vars::Vector{Symbol})

One hot encode variables in vars in df.

source

DSUtils.onehot! — Method

onehot!(df::AbstractDataFrame, var::Symbol)

One hot encode unique values of var in df. New variable name is constructed as var_lvl. If df.var is string, it is standardized first via strstd before one hot encoding.

source

DSUtils.onehot — Method

onehot(lvl, x::Vector)

Create an indicator variable of x equal to lvl. lvl and x can have missing values, ismissing is used where necessary to ensure correct result.

source

DSUtils.ranks — Method

ranks(x; groups = 10, rank = tiedrank, rev = false)

Return a variable which bins x into groups number of bins. The rank keyword allows different ranking method; use rev = true to reverse sort so that small bin number is large value of x. Missing values are assigned to group missing.

Default values of rank = tiedrank and rev = false results in similar grouping as SAS PROC RANK groups=n tied=mean.

source

DSUtils.rocplot — Method

rocplot(x::BCDiag)

returns a ROC plot of x - CDF1 (True Positive) vs. CDF0 (False Positive)

source

DSUtils.strstd — Method

strstd(s; empty = "")

Standardizes string s by stripping leading and trailing blanks, embedded blanks are replaced with underscore _. If s is missing or all blanks, the result is value of empty.

Default of empty is "", missing values and all blank strings become "".

source

DSUtils.sumxm — Method

sumxm(x...)     sum treating missings as 0

Returns the sum of x treating missing values as 0's. x can be varying number of scalars, in this case, their sum is returned. If x is varying number of vectors, they are summed element-wise across the vectors.

source