DSUtils.BCDiagType
BCDiag

A structure of diagnostic properties of a binary Classifier. Facilitates summary plots and tables.

source
DSUtils.accuracyplotMethod
accuracyplot(x::BCDiag; util=[1, 0, 0, 1])

Using util values for [TP, FN, FP, TN], produce accuracy plot and its [max, argmax, argdep]. Default util values of [1, 0, 0, 1] gives the standard accuracy value of (TP+TN)/N.

source
DSUtils.aurocMethod
auroc(class, var; tie = 1e-6)

Calculate area under Receiver Operating Characteristics (ROC) curve, class is a 2 level categorical variable, var is the distribution to analyze. Pair-wise comparison between class 1 values with class 0 values are made as follows:

  • class 1 value > class 0 value is Concordant
  • class 1 value ≈ class 0 value (within tie) is Tied
  • class 1 value < class 0 value is Discordant

Returns:

  • concordant, number of concordant comparisons
  • tied, number of tied comparisons
  • discordant, number of discordant comparisons
  • auc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
  • gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons

Note there are other rank order measures:

  • Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant), no penalty for ties
  • Kendall's Tau is (Concordant - Discordant) / (0.5 N(N-1))
source
DSUtils.bcdiagMethod
bcdiag(target, pred; groups = 100, rev = true, tie = 1e-6)

Perform diagnostics of a binary classifier. target is a 2 level categorical variable, pred is probability of class 1. groups is the number of bins to use for plotting/printing. rev = true orders pred from high to low. tie is the tolerance of pred where values are considered tied.

Returns a BCDiag struct which can be used for plotting or printing:

  • biasplot is calibration plot of target response rate vs. pred response rate
  • ksplot produces ksplot of cumulative distributions
  • rocplot plots the Receiver Operating Characteristics curve
  • accuracyplot plots the accuracy curve with adjustable utility
  • liftcurve is the lift curve
  • cumliftsurve is the cumulative lift surve
  • liftable is the lift table as a DataFrame
  • cumliftable is the cumulative lift table as a DataFrame
source
DSUtils.biasplotMethod
biasplot(x::BCDiag)

returns a bias calibration plot of x - actual response vs. predicted response

source
DSUtils.concordanceMethod
concordance(class, var, tie)

Concordance calculation with flexible tied region (auroc uses fixed width region). class is a 2 level categorical variable, var is the distribution to analyze, tie(x) returns the lower and upper bound of tied region of x.

Pair-wise comparison between class 1 values with class 0 values are made as follows: class 1 value > class 0 value is Concordant; class 1 value ≈ class 0 value (within tie) is Tied; class 1 value < class 0 value is Discordant.

Returns:

  • concordant, number of concordant comparisons
  • tied, number of tied comparisons
  • discordant, number of discordant comparisons
  • auroc, or C, is (Concordant + 0.5Tied) / Total comparisons; same as numeric integration of ROC curve
  • gini, 2C-1, also known as Somer's D, is (Concordant - Discordant) / Total comparisons

Note Goodman-Kruskal Gamma is (Concordant - Discordant) / (Concordant + Discordant); and Kendall's Tau is (Concordant - Discordant) / (0.5 x Total count x (Total count - 1))

source
DSUtils.cumliftcurveMethod
cumliftcurve(x::BCDiag)

returns a cumulative lift curve plot of x - cumulative actual and predicted vs. depth

source
DSUtils.infovalueMethod
infovalue(g::Vector{Integer}, b::Vector{Integer}))

Information value calculation of g, b vector of binned frequency counts

  • weight of evidence = log(density g / density b), 0 adjusted
  • infovalue = sum (density g - density b) * weight of evidence

Industry rule of thumb:

  • iv <= 0.1 no significant change
  • 0.1 < iv <= 0.25 minor change
  • 0.25 < iv major change
source
DSUtils.ksplotMethod
ksplot(x::BCDiag)

returns a KS plot of x - CDF1 (True Positive) and CDF0 (False Positive) versus depth

source
DSUtils.kstestMethod
kstest(class, var; rev = true)

Calculate empirical 2 sample Kolmogorov-Smirnov statistic and its location. class is a 2 level categorical variable, var is the distribution to analyze.

Returns:

  • n, total number of observations
  • n1, number of observations of class 1
  • n0, number of observations of class 0
  • baserate, incidence rate of class 1
  • ks, the maximum separation between the two cumulative distributions
  • ksarg, the value of var at which maximum separation is achieved
  • ksdep, depth of ksarg in the sorted values of var;

rev = true counts depth from high value towards low value.

source
DSUtils.liftcurveMethod
liftcurve(x::BCDiag)

returns a lift curve plot of x - actual and predicted versus depth

source
DSUtils.onehot!Method
onehot!(df::AbstractDataFrame, vars::Vector{Symbol})

One hot encode variables in vars in df.

source
DSUtils.onehot!Method
onehot!(df::AbstractDataFrame, var::Symbol)

One hot encode unique values of var in df. New variable name is constructed as var_lvl. If df.var is string, it is standardized first via strstd before one hot encoding.

source
DSUtils.onehotMethod
onehot(lvl, x::Vector)

Create an indicator variable of x equal to lvl. lvl and x can have missing values, ismissing is used where necessary to ensure correct result.

source
DSUtils.ranksMethod
ranks(x; groups = 10, rank = tiedrank, rev = false)

Return a variable which bins x into groups number of bins. The rank keyword allows different ranking method; use rev = true to reverse sort so that small bin number is large value of x. Missing values are assigned to group missing.

Default values of rank = tiedrank and rev = false results in similar grouping as SAS PROC RANK groups=n tied=mean.

source
DSUtils.rocplotMethod
rocplot(x::BCDiag)

returns a ROC plot of x - CDF1 (True Positive) vs. CDF0 (False Positive)

source
DSUtils.strstdMethod
strstd(s; empty = "")

Standardizes string s by stripping leading and trailing blanks, embedded blanks are replaced with underscore _. If s is missing or all blanks, the result is value of empty.

Default of empty is "", missing values and all blank strings become "".

source
DSUtils.sumxmMethod
sumxm(x...)     sum treating missings as 0

Returns the sum of x treating missing values as 0's. x can be varying number of scalars, in this case, their sum is returned. If x is varying number of vectors, they are summed element-wise across the vectors.

source