DSUtils.BCDiag
DSUtils.accuracyplot
DSUtils.auroc
DSUtils.bcdiag
DSUtils.biasplot
DSUtils.concordance
DSUtils.cumliftable
DSUtils.cumliftcurve
DSUtils.infovalue
DSUtils.ksplot
DSUtils.kstest
DSUtils.liftable
DSUtils.liftcurve
DSUtils.onehot
DSUtils.onehot!
DSUtils.onehot!
DSUtils.ranks
DSUtils.rocplot
DSUtils.strstd
DSUtils.sumxm
DSUtils.BCDiag
— TypeBCDiag
A structure of diagnostic properties of a binary Classifier. Facilitates summary plots and tables.
DSUtils.accuracyplot
— Methodaccuracyplot(x::BCDiag; util=[1, 0, 0, 1])
Using util
values for [TP, FN, FP, TN], produce accuracy plot and its [max, argmax, argdep]. Default util
values of [1, 0, 0, 1] gives the standard accuracy value of (TP+TN)/N.
DSUtils.auroc
— Methodauroc(class, var; tie = 1e-6)
Calculate area under Receiver Operating Characteristics (ROC) curve, class
is a 2 level categorical variable, var
is the distribution to analyze. Pair-wise comparison between class 1 values with class 0 values are made as follows:
- class 1 value > class 0 value is Concordant
- class 1 value ≈ class 0 value (within
tie
) is Tied - class 1 value < class 0 value is Discordant
Returns:
- concordant, number of concordant comparisons
- tied, number of tied comparisons
- discordant, number of discordant comparisons
- auc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
- gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons
Note there are other rank order measures:
- Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant), no penalty for ties
- Kendall's Tau is (Concordant - Discordant) / (0.5 N(N-1))
DSUtils.bcdiag
— Methodbcdiag(target, pred; groups = 100, rev = true, tie = 1e-6)
Perform diagnostics of a binary classifier. target
is a 2 level categorical variable, pred
is probability of class 1. groups
is the number of bins to use for plotting/printing. rev
= true orders pred
from high to low. tie
is the tolerance of pred
where values are considered tied.
Returns a BCDiag struct which can be used for plotting or printing:
biasplot
is calibration plot oftarget
response rate vs.pred
response rateksplot
produces ksplot of cumulative distributionsrocplot
plots the Receiver Operating Characteristics curveaccuracyplot
plots the accuracy curve with adjustable utilityliftcurve
is the lift curvecumliftsurve
is the cumulative lift surveliftable
is the lift table as a DataFramecumliftable
is the cumulative lift table as a DataFrame
DSUtils.biasplot
— Methodbiasplot(x::BCDiag)
returns a bias calibration plot of x
- actual response vs. predicted response
DSUtils.concordance
— Methodconcordance(class, var, tie)
Concordance calculation with flexible tied region (auroc
uses fixed width region). class
is a 2 level categorical variable, var
is the distribution to analyze, tie
(x) returns the lower and upper bound of tied region of x.
Pair-wise comparison between class 1 values with class 0 values are made as follows: class 1 value > class 0 value is Concordant; class 1 value ≈ class 0 value (within tie
) is Tied; class 1 value < class 0 value is Discordant.
Returns:
- concordant, number of concordant comparisons
- tied, number of tied comparisons
- discordant, number of discordant comparisons
- auroc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
- gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons
Note Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant); and Kendall's Tau is (Concordant - Discordant) / (0.5 x Total count x (Total count - 1))
DSUtils.cumliftable
— Methodcumliftable(x::BCDiag)
returns a cumulative lift table of x
as a DataFrame
DSUtils.cumliftcurve
— Methodcumliftcurve(x::BCDiag)
returns a cumulative lift curve plot of x
- cumulative actual and predicted vs. depth
DSUtils.infovalue
— Methodinfovalue(g::Vector{Integer}, b::Vector{Integer}))
Information value calculation of g
, b
vector of binned frequency counts
- weight of evidence = log(density g / density b), 0 adjusted
- infovalue = sum (density g - density b) * weight of evidence
Industry rule of thumb:
- iv <= 0.1 no significant change
- 0.1 < iv <= 0.25 minor change
- 0.25 < iv major change
DSUtils.ksplot
— Methodksplot(x::BCDiag)
returns a KS plot of x
- CDF1 (True Positive) and CDF0 (False Positive) versus depth
DSUtils.kstest
— Methodkstest(class, var; rev = true)
Calculate empirical 2 sample Kolmogorov-Smirnov statistic and its location. class
is a 2 level categorical variable, var
is the distribution to analyze.
Returns:
- n, total number of observations
- n1, number of observations of class 1
- n0, number of observations of class 0
- baserate, incidence rate of class 1
- ks, the maximum separation between the two cumulative distributions
- ksarg, the value of
var
at which maximum separation is achieved - ksdep, depth of ksarg in the sorted values of
var
;
rev
= true counts depth from high value towards low value.
DSUtils.liftable
— Methodliftable(x::BCDiag)
returns a lift table of x
as a DataFrame
DSUtils.liftcurve
— Methodliftcurve(x::BCDiag)
returns a lift curve plot of x
- actual and predicted versus depth
DSUtils.onehot!
— Methodonehot!(df::AbstractDataFrame, vars::Vector{Symbol})
One hot encode variables in vars
in df
.
DSUtils.onehot!
— Methodonehot!(df::AbstractDataFrame, var::Symbol)
One hot encode unique values of var
in df
. New variable name is constructed as var_lvl
. If df.var
is string, it is standardized first via strstd
before one hot encoding.
DSUtils.onehot
— Methodonehot(lvl, x::Vector)
Create an indicator variable of x
equal to lvl
. lvl
and x
can have missing values, ismissing
is used where necessary to ensure correct result.
DSUtils.ranks
— Methodranks(x; groups = 10, rank = tiedrank, rev = false)
Return a variable which bins x
into groups
number of bins. The rank
keyword allows different ranking method; use rev = true
to reverse sort so that small bin number is large value of x
. Missing values are assigned to group missing
.
Default values of rank = tiedrank
and rev = false
results in similar grouping as SAS PROC RANK groups=n tied=mean.
DSUtils.rocplot
— Methodrocplot(x::BCDiag)
returns a ROC plot of x
- CDF1 (True Positive) vs. CDF0 (False Positive)
DSUtils.strstd
— Methodstrstd(s; empty = "")
Standardizes string s
by stripping leading and trailing blanks, embedded blanks are replaced with underscore _. If s
is missing or all blanks, the result is value of empty
.
Default of empty
is "", missing values and all blank strings become "".
DSUtils.sumxm
— Methodsumxm(x...) sum treating missings as 0
Returns the sum of x
treating missing values as 0's. x
can be varying number of scalars, in this case, their sum is returned. If x
is varying number of vectors, they are summed element-wise across the vectors.