class: center, middle, title-slide .title[ # Final Notes ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2024-05-03 ] --- <link rel="stylesheet" href="stat4580.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 85%; } </style> ## Some Other Topics Some topics we did not have time to look at: -- * Working with models ([Chapter 6 in Healy, 2018](https://socviz.co/modeling.html); [Tidy Modeling with R](https://www.tmwr.org/)). -- * [Visualizing missing values](http://naniar.njtierney.com/articles/naniar-visualisation.html). -- * Visualizing uncertainty ([Chapter 16 of Wilke, 2019](https://clauswilke.com/dataviz/visualizing-uncertainty.html) and [below](#visualizing-uncertainty)) -- * Plot annotation, plot ensembles, and dashboards. ([Part II of Wilke, 2019](https://clauswilke.com/dataviz/proportional-ink.html); [Chapter 5 of Healy, 2018](https://socviz.co/workgeoms.html); [below](#plot-annotation-plot-ensembles-and-dashboards)). * Data Science Ethics ([below](#data-science-ethics)). --- layout: true ## Visualizing Uncertainty: Hurricanes --- name: visualizing-uncertainty All estimates from data are associated with some degree of uncertainty. -- Effectively communicating that uncertainty in visualizations is challenging and an active area of [research](http://space.ucmerced.edu/chapter). -- The _cone of uncertainty_: (From Cairo (2019); images from a [blog post](http://www.thefunctionalart.com/2020/01/all-graphics-from-how-charts-lie-freely.html) by the author.) <!-- https://www.dropbox.com/sh/d1kb0jdrhkb43j9/AADTBfRvAh-mxmSxBRNZpLJja/5.CHAPTER5?dl=0&preview=PDF10.Tropicalstorm.pdf&subfolder_nav_tracking=1 --> .pull-left[ <img src="../img/PDF10.Tropicalstorm.png" width="450" /> ] -- .pull-right[ The [NHC forecast cone](https://www.nhc.noaa.gov/aboutcone.shtml) is designed so that two-thirds of historical official forecast errors over a 5-year sample fall within the cone for a particular time point.. ] --- When published in the media these visualizations are routinely misinterpreted something like this: <!-- https://www.dropbox.com/sh/d1kb0jdrhkb43j9/AADTBfRvAh-mxmSxBRNZpLJja/5.CHAPTER5?dl=0&preview=PDF11.StormWRONGSize.pdf&subfolder_nav_tracking=1 --> .pull-left.width-65[ <img src="../img/PDF11.StormWRONGSize.png" width="573" /> ] --- A more effective representation might be something like this, showing an _ensemble_ of possible tracks: <!-- https://www.dropbox.com/sh/d1kb0jdrhkb43j9/AADTBfRvAh-mxmSxBRNZpLJja/5.CHAPTER5?dl=0&preview=PDF13.StormLines.pdf&subfolder_nav_tracking=1 --> .pull-left.width-65[ <img src="../img/PDF13.StormLines.png" width="573" /> ] -- .pull-right.width-30[ An animated version may be more effective, if the presentation medium permits. {{content}} ] -- Developing better visualizations for hurricane forecasting, especially targeting the public, is an active area of research. --- layout: true ## Visualizing Uncertainty: Chocolate Bars --- .pull-left[ [Expert ratings](http://flavorsofcacao.com), on a scale from 0 to 5, for chocolate bars manufactured in several countries: ] .pull-right[ ![](wrapup_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- .pull-left[ The standard deviations of the data distributions are comparable, but the lengths of confidence intervals for the mean vary because of the different sample sizes: ] .pull-right[ ![](wrapup_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- .pull-left[ The same plot with a reduced horizontal range: ] .pull-right[ ![](wrapup_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- .pull-left[ A more elaborate display with confidence intervals at several levels: ] .pull-right[ ![](wrapup_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- .pull-left[ Confidence densities, or confidence distributions, as proposed in > Adrian W. Bowman. Graphics for Uncertainty. J. R. Statist. Soc. A > 182:1-16, 2018. [Link](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12379) ] .pull-right[ ![](wrapup_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- One drawback of all of these methods: -- > The least precise measurement draws the most attention. -- These examples from Wilke's book use the [`ungeviz` package](https://github.com/wilkelab/ungeviz) available on GitHub. -- Another package providing some tools for uncertainty visualization is [`ggdist` package](https://mjskay.github.io/ggdist/). --- layout: true ## Visualizing Uncertainty: Old Cars --- Using the very old `mtcars` data set to illustrate estimating a smooth relationship: -- .pull-left[ ![](wrapup_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] -- .pull-right[ A default `geom_smooth` shows an estimate along with a point-wise confidence band. {{content}} ] -- This may not give the best sense of the joint uncertainty: if the curve is higher on some places it may need to be lower in others. --- Showing an _ensemble_ of curves that all are plausible can be a better choice. -- .pull-left[ ![](wrapup_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] -- .pull-right[ This approach was shown earlier for visualizing possible hurricane paths. {{content}} ] -- This ensemble is generated using a _case-based bootstrap_. {{content}} -- These plots are called _ensemble plots_ (also spaghetti plots, for obvious reasons). --- If animation is available, an alternative is to show the curves one at a time in an animation. -- .pull-left[ ![](wrapup_files/figure-html/unnamed-chunk-11-1.gif)<!-- --> ] -- .pull-right[ Again, a bootstrap is used to produce the estimates. {{content}} ] -- This is an example of a _hypothetical outcomes plot_, or _HOP_, as introduced in > Hullman, Jessica, Paul Resnick, and Eytan Adar. "Hypothetical > outcome plots outperform error bars and violin plots for inferences > about reliability of variable ordering." PLOS ONE 10, no. 11 (2015). --- layout: true ## Data Quality and Integrity --- A visualization can accurately reflect data but still be misleading if the data are faulty. -- A [NY Times article](https://www.nytimes.com/2021/05/03/health/covid-herd-immunity-vaccine.html) from May 2021 shows a choropleth map of the estimated share of adults who would "definitely" or "probably" get the COVID-19 vaccine. -- .pull-left[ Cutoffs: 49 60 65 70 75 80 91 % <img src="../img/map-1050.png" width="500" /> ] -- .pull-right[ The map may accurately reflect the estimates, but the estimates have obvious problems. ] --- .pull-left[ The data used for the map are available [here](https://aspe.hhs.gov/reports/vaccine-hesitancy-covid-19-state-county-local-estimates). {{content}} ] -- Discussions on social media suggest that the state level data may be more reasonable: -- <!-- ## nolint start: line_length --> .pull-right.width-50[ <!-- https://twitter.com/ct_bergstrom/status/1390509298388660231?s=11 --> ![](wrapup_files/figure-html/unnamed-chunk-13-1.png)<!-- --> <!-- ## nolint end --> <!-- not being able to center a long legend tilte seems to be a current ggplot bug --> ] --- layout: false name: data-science-ethics ## Data Science Ethics Some issues: -- * Data misrepresentation -- * Data falsification -- * Data privacy -- * Data scraping and terms of use -- * Algorithmic bias -- Some references: <!-- https://arxiv.org/abs/1908.06166 --> * [Data science ethics](https://mdsr-book.github.io/mdsr2e/ch-ethics.html) chapter in: Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton (2021) [_Modern Data Science with R, 2nd edition_](https://mdsr-book.github.io/mdsr2e/). -- * [Data science ethics](https://datasciencebox.org/02-ethics.html) section of the online book [Data Science in a Box](https://datasciencebox.org/index.html) by Mine Çetinkaya-Rundel.. -- * Alberto Cairo (2019) _How Charts Lie: Getting Smarter about Visual Information_, W. W. Norton & Company. --- name: plot-annotation-plot-ensembles-and-dashboards ## Plot Annotation, Plot Ensembles, and Dashboards Plot annotations can create popout and help focus the viewer's attention. -- They may be increasingly important as images are shared on line without context. -- Here is an examples for the `mpg` data: <!-- ## nolint start: line_length --> ![](wrapup_files/figure-html/unnamed-chunk-14-1.png)<!-- --> <!-- ## nolint end --> --- layout: true ## Plot Ensembles: Coffee --- It is often useful to use several graphics to present an analysis. -- Collections of related graphs are sometimes called _ensemble graphics_. -- On line presentations of analyses involving multiple visualizations and, typically, some interactive features are also called _dashboards_. -- To aid the viewer it is usually best to design these visualizations together, with common axis choices and color mappings. -- Fig 12.1 in Unwin (2015) provides a simple example: --- <!-- ## nolint start: line_length --> <img src="wrapup_files/figure-html/unnamed-chunk-15-1.png" alt="A dashboard with three plots. A bar chart shows there are about 4 times as many Arabica samples ad Rubusta samples. A scatterplot of Caffeine against Fat content shows clear separation of the two groups. A parallel coordinates plot shows the 12 values measured on each group." style="float:right; padding:10px" /> <!-- ## nolint end --> Data on the chemical composition of coffee samples collected from around the world, comprising 43 samples from 29 countries. Each sample is either of the Arabica or Robusta variety. Twelve of the thirteen chemical constituents reported in the study are given. The omitted variable is total chlorogenic acid; it is generally the sum of the chlorogenic, neochlorogenic and isochlorogenic acid values. Streuli, H. (1973). Der heutige stand der kaffeechemie. In _Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty_, Bogata, Columbia, pp. 61-72. --- layout: true ## Making a Point and Telling a Story --- In a report, make sure each plot has a point and makes its point. -- Make sure to think about: -- * axis labels; -- * titles and subtitles; -- * captions; -- * highlighting key features; <!-- gghighlight --> -- * accessibility (e.g. color choice; alt-text). -- It is often good to make sure a figure can stand on its own without asking the reader to search the text for explanations. --- Communicating with data is like telling a story, with a starting point, a journey, and an end. -- Sometimes a single visualization can capture the full story. -- More often, several visualizations will be needed. -- Often it is good to: -- * start with a high level overview; -- * show how to look at some particular cases, e.g. with a single plot; -- * build up to a more complete analysis, e.g. with a multi-panel plot. -- With multiple visualizations it is good make sure that: -- * each one works well on its own; -- * they work well together (e.g. use consistent styling, colors). --- There is a [chapter of Wilke, 2019](https://clauswilke.com/dataviz/telling-a-story.html) with more advice on this. -- A recent book length treatment is > Deborah Nolan and Sara Stoudt (2021) _Communicating with Data_, > Oxford Univerity Press. --- layout: true ## Wrapping Up Some of the areas we covered: --- -- ### Visualization -- Many different types of graphs. * Strengths, weaknesses. * Pitfalls. * Scalability. * Creating these graphs in R. -- Perception * Channels and mappings; relative effectiveness. * Using to assess, design visualizations. * Effective use of color. -- A little on interaction, animation. -- Emphasis on techniques useful for exploration, scientific reporting. --- ### Data Technologies -- Reading different data formats. -- Scraping data from the web. -- Cleaning data. -- Rearranging data for analysis. -- Merging data from several sources. -- ### Reproducible research tools -- `rmarkdown` for integrating code and reporting. -- Version control, `git`, `GitLab`. --- layout: false ## Learning More Class notes will remain available, in some form, at the class web site. -- Some books to look at: -- * Alberto Cairo (2019) _How Charts Lie: Getting Smarter about Visual Information_, W. W. Norton & Company. -- * Claus O. Wilke (2019) [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/), O’Reilly, Inc. ([Book source on GitHub](https://github.com/clauswilke/dataviz); [supporting materials on GitHub](https://github.com/clauswilke/dviz.supp)) -- * Kieran Healy (2018) [_Data Visualization: A practical introduction_](https://socviz.co/), Princeton -- * Winston Chang (2018) [_R Graphics Cookbook_, 2nd edition](https://r-graphics.org/), O’Reilly. ([Book source on GitHub](https://github.com/wch/rgcookbook)) -- Some blogs to check out: -- * [Junk Charts](https://junkcharts.typepad.com/) -- * [The Functional Art Blog](http://www.thefunctionalart.com/) -- * [Flowing Data](https://flowingdata.com/) -- Keep a critical eye out for good (and not so good) uses of data visualization in the media.
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505