We illustrate how to use R package Mposterior and its C++ version for estimating a median in the space of probability measures that was proposed as M-posterior in Minsker et al., 2014. The practical motivation and theoretical justification of M-posterior are beyond the scope of these notes; we (briefly) illustrate how to use the Mposterior package and code under the assumption that samples from the subset posterior measures are available, which might be not be a good assumption in certain non-trivial high-dimensional cases. The tarball for Mposterior package is available here and the C++ code is available in color and in plain text. For using the R package, Rcpp and RcppArmadillo must be installed apriori; for using the C++ code, please install Armadillo linear algebra library.
Throughout these notes we assume that the number of subsets m = 5 and each subset posterior measure has n = 100 atoms; each subset posterior measure is simulated from a 2-dimensional Gaussian measure with an 2 by 2 identity covariance matrix and varying means. All code are enclosed in
and the outputs are enclosed in
We first simulate data for 5 subset posterior measures from five different bi-variate Gaussian densities; each subset posterior measure has 100 atoms and identity covariance matrix.
set.seed(12345) write.table(cbind(rnorm(100, mean = 1), rnorm(100, mean = 1)), file = "atom1.txt", sep = "\t", row.names = FALSE, col.names = FALSE) write.table(cbind(rnorm(100, mean = -1), rnorm(100, mean = -1)), file = "atom2.txt", sep = "\t", row.names = FALSE, col.names = FALSE) write.table(cbind(rnorm(100, mean = -1), rnorm(100, mean = 1)), file = "atom3.txt", sep = "\t", row.names = FALSE, col.names = FALSE) write.table(cbind(rnorm(100, mean = 1), rnorm(100, mean = -1)), file = "atom4.txt", sep = "\t", row.names = FALSE, col.names = FALSE) write.table(cbind(rnorm(100, mean = 2), rnorm(100, mean = 2)), file = "atom5.txt", sep = "\t", row.names = FALSE, col.names = FALSE)
The files atom[1-5].txt will be used by both R and C++ code of Mposterior.
We first create a list subAtomList of length 5 and store the simulated data atom[1-5].txt as its elements. Although we restrict equal number of atoms in each subset, the R package can handle any number of atoms and dimensions in the subset posterior measures.
subAtomList <- vector("list", 5) subAtomList[] <- as.matrix(read.table("atom1.txt", header = FALSE, sep = "\t")) subAtomList[] <- as.matrix(read.table("atom2.txt", header = FALSE, sep = "\t")) subAtomList[] <- as.matrix(read.table("atom3.txt", header = FALSE, sep = "\t")) subAtomList[] <- as.matrix(read.table("atom4.txt", header = FALSE, sep = "\t")) subAtomList[] <- as.matrix(read.table("atom5.txt", header = FALSE, sep = "\t"))
read.table reads data as a data.frame; as.matrix ensures that elements of subAtomList are matrices. This convention is assumed by the findWeiszfeldMedian in Mposterior package, which is the work-horse function for estimating M-posterior.
Now load the Mposterior package.
If the package was installed correctly, then the following output appears on the R terminal.
Loading required package: Rcpp
Loading required package: RcppArmadillo
Mposterior: R package for Robust and Scalable Bayes via a Median of Subset Posterior Measures
(Version 0.1.2, built: 2014-06-01)
Copyright (C) 2013-2014 Sanvesh Srivastava
The following simple command finds the median of subset posteriors (M-posterior) using findWeiszfeldMedian.
medPosterior <- findWeiszfeldMedian(subAtomList, sigma = 0.1, maxit = 100, tol = 1e-10)
The following output appears in the R terminal during the Weiszfeld iterations for estimating M-posterior.
Weiszfeld iteration: 10
Weiszfeld iteration: 20
Weiszfeld iteration: 30
Weiszfeld iteration: 40
Weiszfeld algorithm coverged at iteration: 48
which is a list with following named elements.
“natoms” “weiszfeldWts” “historyWeiszfeldWts” “medianAtoms” “sigma” “maxit”
the output of which is the same as
The previous two command yield the following output.
Subset Natoms WeiszfeldWts
1 100 0.34
2 100 0.14
3 100 0.16
4 100 0.19
5 100 0.17
M-posterior has 2 dimensions
RBF kernel with sigma = 0.1
Weiszfeld algorithm converged in 48 iterations
We can also plot the Mposterior object and check the progress of weiszfeldWts over the course of Weiszfeld algorithm using the generic plot function
The C++ code for Mposterior leaves the responsibility of correctly importing the data on the user. If the atoms of subset posterior measures are imported correctly, then the code is fairly general and easy to use. To modify the code for importing the data correctly, one only needs to tweak the code in main function. Currently, all the atoms in each subset posteriors are 100 and the code needs to be modified appropriately to reflect the changes in imported data.
If we assume that code has been modified to import the data correctly and Armadillo library has been installed, then following two commands respectively compile the Mposterior.cpp code and execute it.
g++ Mposterior.cpp -I/usr/local/Cellar/armadillo/3.900.1/include -O2 -framework Accelerate ./a.out
Few points to note in the former command are as follows. First, Armadillo is installed in the /usr/local/Cellar/armadillo/3.900.1/ directory; this path is not standard and needs to be included explicitly on each machine; different machines can have different set-ups. Second, -O2 is a gcc optimization flag. Third, because this code was run on a Mac, we use the Accelerate framework. The equivalent command on non-Mac and unix-like system resembles the following command.
g++ Mposterior.cpp -I/path/to/armadillo/include -O2 -llapack -lblas ./a.out
see also Armadillo FAQs. Successful execution of Mposterior code stores “natoms.txt”, “weiszfeldWts.txt”, “historyWeiszfeldWts.txt”, and “MposteriorAtoms.txt” files is the same directory as that of Mposterior.cpp. These files are the C++ equivalents of their R counterparts.
We compare the R and C++ results and find that they both agree using the following R commands.
cppWtsHist <- as.matrix(read.table("historyWeiszfeldWts.txt")) par(mfrow = c(1, 5)) plot(medPosterior$historyWeiszfeldWts[1, ], cppWtsHist[1, ], xlab = "R output", ylab = "Cpp output", main = "Comp. Weiszfeld Wts") abline(0, 1, col = 'red') plot(medPosterior$historyWeiszfeldWts[2, ], cppWtsHist[2, ], xlab = "R output", ylab = "Cpp output", main = "Comp. Weiszfeld Wts") abline(0, 1, col = 'red') plot(medPosterior$historyWeiszfeldWts[3, ], cppWtsHist[3, ], xlab = "R output", ylab = "Cpp output", main = "Comp. Weiszfeld Wts") abline(0, 1, col = 'red') plot(medPosterior$historyWeiszfeldWts[4, ], cppWtsHist[4, ], xlab = "R output", ylab = "Cpp output", main = "Comp. Weiszfeld Wts") abline(0, 1, col = 'red') plot(medPosterior$historyWeiszfeldWts[5, ], cppWtsHist[5, ], xlab = "R output", ylab = "Cpp output", main = "Comp. Weiszfeld Wts") abline(0, 1, col = 'red')
Both R package and C++ code can be improved in many directions. The main focus in the future updates of the packages to include the function for estimating M-posterior's close cousin “metric posterior”. We also plan to make the C++ code more user-friendly by improving the defaults for importing the data and by including a command line interface for specifying various parameters of Weiszfeld algorithm (as in R).
The R package Mposterior is based on the excellent suite of Rcpp packages by Eddelbuettel and coauthors. At the heart of the R package and C++ code lies the Armdadillo linear algebra library that facilitates Mposterior code to strike a “good balance between speed and ease of use” (your mileage may vary, however).
This work was partially supported by grants from
National Science Foundation (NSF) under Grant DMS-1127914 to SAMSI.
National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH) through grant R01-ES-017436 to David B. Dunson and colleagues.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, NIEHS, or NIH.
These software are provided under GPL v3.0.