6 Methods

This section provides details on the implementation of the Naive Bayes Classifier algorithm, definition of uncommon terms, and calculation of performance metrics.

6.1 Naive Bayes Classifier

The Naive Bayes Classifier (NBC) is a machine learning algorithm that uses training data containing cases of deaths to learn probabilities for known causes of death based on given symptoms. This produces a model that can use the learned probabilities to predict the cause of death for cases in unseen testing data with same symptoms.

The nbc4va package implements the NBC algorithm for verbal autopsy data using code and methods built on Miasnikof et al (2015).

6.2 Terms for Data

Symptom: Refers to the features or independent variables with binary values of 1 for presence and 0 for absence of a death related condition.

Cause: Refers to the target or dependent variable containing discrete values of the causes of death.

Case: Refers to an individual death containing an identifier, a cause of death (if known), and several symptoms.

Training Data: Refers to a dataset of cases that the NBC algorithm learns probabilities from.

Testing Data: Refers to a dataset of cases used to evaluate the performance of a NBC model; these cases must have the same symptoms as the Training Data, but with different cases.

6.3 Terms for Metrics

True Positives: The number of cases, given a cause, where the predicted cause is equal to the actual observed cause (Fawcett, 2005).

True Negatives: The number of cases, given a cause, where the predicted is not the cause and the actual observed is also not the cause (Fawcett, 2005).

False Positives: The number of cases, given a cause, where the predicted is the cause and the actual observed is not the cause (Fawcett, 2005).

False Negatives: The number of cases, given a cause, where the predicted is not the cause and the actual observed is the cause (Fawcett, 2005).

CSMF: The fraction of deaths (predicted or observed) for a particular cause.

6.4 Calculation of Metrics at the Individual Level

The following metrics measure the performance of a model by comparing its predicted causes individually to the matching true/observed causes.

Sensitivity: proportion of correctly identified positives (Powers, 2011).

\[ Sensitivity = \frac{TP}{TP+FN} \]

where:

  • \(TP\) is the number of true positives
  • \(FN\) is the number of false negatives
  • This metric measures a model’s ability to correctly predict causes of death

PCCC: partial chance corrected concordance (Murray et al 2011).

\[ PCCC(k) = \frac{C-\frac{k}{N}}{1-\frac{k}{N}} \]

where:

  • \(C\) is the fraction of deaths where the true cause is in the top \(k\) causes assigned to that death
  • \(k\) is the number of top causes (constant of 1 in this package)
  • \(N\) is the number of causes in the study
  • This metric measures how much better a model is than random assignment

6.5 Calculation of Metrics at the Population Level

The following metrics measure the performance of a model by comparing its distribution of cause predictions to a distribution of true/observed causes for similar cases.

CSMFmaxError: cause specific mortality fraction maximum error (Murray et al 2011).

\[ CSMF Maximum Error = 2(1-Min(CSMF_{j}^{true}) \]

where:

  • \(j\) is a true/observed cause
  • \(CSMFtruej\) is the true/observed CSMF for cause \(j\)

CSMFaccuracy: cause specific mortality fraction accuracy (Murray et al 2011).

\[ CSMFAccuracy = 1-\frac{\sum_{j=1}^{k} |CSMF_{j}^{true} - CSMF_{j}^{pred}|}{CSMF Maximum Error} \]

where:

  • \(j\) is a cause
  • \(CSMFtruej\) is the true/observed CSMF for cause \(j\)
  • \(CSMFpredj\) is the predicted CSMF for cause \(j\)
  • Values range from 0 to 1 with 1 meaning no error in the predicted CSMFs, and 0 being complete error in the predicted CSMFs

6.6 References for Methods

  • Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters[Internet]. 2005 Dec 19[cited 2016 Apr 29];27(8):861-874. Available from: http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
  • Miasnikof P, Giannakeas V, Gomes M, Aleksandrowicz L, Shestopaloff AY, Alam D, Tollman S, Samarikhalaj, Jha P. Naive Bayes classifiers for verbal autopsies: comparison to physician-based classification for 21,000 child and adult deaths. BMC Medicine. 2015;13:286. 10.1186/s12916-015-0521-2.
  • Murray CJL, Lozano R, Flaxman AD, Vahdatpour A, Lopez AD. Robust metrics for assessing the performance of different verbal autopsy cause assignment methods in validation studies.Popul Health Metr. 2011;9:28. 10.1186/1478-7954-9-28.
  • Powers DMW. EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION. Journal of Machine Learning Technologies. 2011;2(1)37-63.