{"id":444,"date":"2016-02-20T09:44:26","date_gmt":"2016-02-20T09:44:26","guid":{"rendered":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/?p=444"},"modified":"2016-05-01T07:20:09","modified_gmt":"2016-05-01T07:20:09","slug":"looping-functions-with-ggplot2","status":"publish","type":"post","link":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/2016\/02\/20\/looping-functions-with-ggplot2\/","title":{"rendered":"Looping Functions with ggplot2"},"content":{"rendered":"<div>Working with a large number of samples and\u00a0many variables can be especially challenging. \u00a0In our research, we typically have a few thousand variables (identified molecular formulas from mass spectra) for just a few samples. \u00a0But recently, we stepped up our game and started working with a dataset that had 42 samples and close to 6000 variables.<\/div>\n<p><!--more--><\/p>\n<div><\/div>\n<p>To review the data for quality assurance, we use a combination of qualitative and quantitative measures. \u00a0The qualitative assessment is best done by plotting the data a few different ways and then reviewing them for inconsistencies. \u00a0Often the inconsistencies are incorrectly assigned molecular formulas. \u00a0So to proceed, we first filter the data to remove formulas that are chemically unreasonable (e.g, hydrogen-to-carbon ratios \u00a0&lt; 0.3 are highly unlikely). \u00a0However there are other molecular formulas that are within the chemical boundaries, but are quite dissimilar from the others. So, to find those, we have to manually review the data. \u00a0This requires preparing 3 different plots (a reconstructed mass spectra of the assigned molecular formula masses, an elemental ratio plot (aka a van Krevelen diagram), and a Kendrick plot. \u00a0This is a daunting task with 42 samples, so it&#8217;s at this point that you ask yourself, \u201cDo I really have to copy and paste the same lines of code over and over again editing them each time to make these plots?\u201d<\/p>\n<p>Luckily the answer is &#8220;No\u201d. \u00a0As I recently discovered a very good model solution by Kevin Davenport on the <a href=\"http:\/\/www.r-bloggers.com\/ggplot2-graphics-in-a-loop\/\" target=\"_blank\">r-bloggers website<\/a>. \u00a0I was immediately intrigued by this even though, I\u2019m still a newbie with <a href=\"https:\/\/www.r-project.org\/\" target=\"_blank\">R<\/a> and I\u2019m not crazy about writing loops. \u00a0Loops are often unnecessary in R, thanks in large part to the <a href=\"http:\/\/www.r-bloggers.com\/using-apply-sapply-lapply-in-r\/\" target=\"_blank\">Apply family<\/a> of functions. Regardless, the blog post by Kevin Davenport for automatically generating histograms was written sufficiently clear to be adapted here relatively easily. \u00a0Below I describe the procedure with specific details that apply to our ultrahigh resolution mass spectrometry application.<\/p>\n<div>The first step was to realize that the looping function requires input from a data frame that contains the sample data to be plotted without unnecessary columns. \u00a0To prepare my data frame, I used the <i>select()<\/i> function from <a href=\"https:\/\/rpubs.com\/justmarkham\/dplyr-tutorial\" target=\"_blank\">dplyr<\/a>. \u00a0My new data frame is called <i>ML_MS<\/i>, which stands for Master List for Mass Spectra. \u00a0This is a subset of <i>ML<\/i> (Master List with all of the aligned formulas and formula details). \u00a0To plot the mass spectra, I need the exact mass of the molecular formulas (<i>theor_mass<\/i>) for the x-axis and the relative abundance of the ions that were observed in each of the samples. \u00a0The relative abundance values are stored for each of the observations, which are labeled <i>S0625a<\/i> through <i>S0728a<\/i>. Note here that often molecular formulas are uniquely observed in a few observations, so each of the observations has many NA values.<\/div>\n<div>.<\/div>\n<pre><span style=\"color: #333333;\">ML_MS &lt;- select(ML, theor_mass, group, S0625a:S0728a)<\/span><\/pre>\n<div>.<\/div>\n<div>The next step is to write the looping function. \u00a0I decided to call my function <i>plotMassSpectra()<\/i>. \u00a0The function is defined to take in <i>X<\/i> data (where <i>X<\/i> represents the data frame to be assigned when applying the function in the last step) and ignore NA values. \u00a0Then a vector list <i>nm<\/i> is created with the names of the columns from the data frame <i>X<\/i>, using the <i>names()<\/i> function. \u00a0Then a <i>for loop<\/i> is used to iterate over all of the columns in the list <i>nm<\/i>, using the <i>seq_along()<\/i> function. \u00a0The next step is to write the ggplot instructions and assign them to a temporary object (called <i>plots<\/i>). \u00a0Notice that I used the <i>aes_string()<\/i> function rather than <i>aes()<\/i>. \u00a0This means that you must place your variable names in quotes (in this case, <i>\u201ctheor_mass\u201d<\/i> and <i>\u201cgroup\u201d<\/i>). \u00a0The whole key to making this work is using the list <i>nm<\/i> to iterate over each of the samples. \u00a0In this case, my y-axis information is the relative abundances stored in the individual columns of <i>ML_MS<\/i>. \u00a0Thus, the y-axis is assigned to <i>nm[i]<\/i>. \u00a0Note for plotting mass spectra, we need to assign the starting and end values (e.g., where <i>y = 0<\/i> and <i>yend = RA<\/i>). \u00a0Finally, you will want to write a <i>save()<\/i> function to save the plot to the directory. \u00a0One additional step that I found to be especially helpful was to insert <i>nm[i]<\/i> into the plot tittle and the plot file name. \u00a0In this way, there is a clear record of what data were plotted in each of the graphs.<\/div>\n<div>.<\/div>\n<pre>plotMassSpectra &lt;- function(X, na.rm = TRUE, ...) {<\/pre>\n<pre>\u00a0 nm &lt;- names(X)<\/pre>\n<pre>\u00a0 for (i in seq_along(nm)) {<\/pre>\n<pre>\u00a0 \u00a0 plots &lt;-ggplot(X, aes_string(x = \"theor_mass\", xend = \"theor_mass\", y = 0, yend = nm[i])) + geom_segment(aes_string(color = \"group\")) +<\/pre>\n<pre>\u00a0 \u00a0 \u00a0 labs(x = \"m\/z\", y = \"Relative Abundance\", title = paste(\"Reconstructed Mass Spectra\", nm[i], \"- QA1\", sep = \" \"))<\/pre>\n<pre>\u00a0 \u00a0 ggsave(plots, filename = paste(\"MS_\", nm[i], \".jpg\", sep=\"\"))<\/pre>\n<pre>\u00a0 }<\/pre>\n<pre>}<\/pre>\n<div>.<\/div>\n<div>Now to execute the function on the prepared data frame, call the new function.<\/div>\n<div>.<\/div>\n<pre>plotMassSpectra(ML_MS)<\/pre>\n<div>.<\/div>\n<div>\n<div id=\"attachment_445\" style=\"width: 870px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/MS_S0625a.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-445\" class=\"wp-image-445 size-large\" src=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/MS_S0625a-1024x731.jpg\" alt=\"MS_S0625a\" width=\"860\" height=\"614\" srcset=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/MS_S0625a-1024x731.jpg 1024w, http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/MS_S0625a-300x214.jpg 300w\" sizes=\"auto, (max-width: 860px) 100vw, 860px\" \/><\/a><p id=\"caption-attachment-445\" class=\"wp-caption-text\">An example of the output from the plotMassSpectra function.<\/p><\/div>\n<\/div>\n<div>.<\/div>\n<div>In my first several attempts, I got various error messages. \u00a0I resolved them one by one and then finally saw the files being written to the directory. \u00a0After tweaking the looping function a few times. \u00a0For example, I added color to the mass spectra which are defined by <i>group<\/i> which is a factor. \u00a0Note that my <i>nm<\/i> list includes a few columns which are required for the plot, but are not suppose to be assigned to the y-axis, thus I get 2 additional plots that I manually delete later.<\/div>\n<div><\/div>\n<div>The same procedure was adapted for the next plot, which is used to examine the elemental ratios. \u00a0In this plot, the oxygen-to-carbon ratios (<i>O_C<\/i>) of the molecular formulas are plotted on the x-axis and the hydrogen-to-carbon ratios (<i>H_C<\/i>) are plotted on the y-axis. \u00a0To examine the specific points of interest and see their relative abundance, the symbols are scaled to the relative abundances for each of the observations (<i>S0625a<\/i> through <i>S0728a<\/i>). \u00a0Similar to above, a specific data frame was created which stores the necessary data. \u00a0It\u2019s called <i>ML_VK<\/i> for Master List for van Krevelen plots.<\/div>\n<div>.<\/div>\n<pre>ML_VK &lt;- select(ML, group, O_C, H_C, S0625a:S0728a)<\/pre>\n<div>.<\/div>\n<div>The looping function is similar to that given above. \u00a0Again, <i>X<\/i> is the place holder for the data of the data frame to be supplied when the function is executed. \u00a0Like above, the key to making this work is the list, defined as <i>nm<\/i>. \u00a0It\u2019s used in several points below.<\/div>\n<div>.<\/div>\n<pre>plotVK &lt;- function(X, na.rm = TRUE, ...) {<\/pre>\n<pre>\u00a0 nm &lt;- names(X)<\/pre>\n<pre>\u00a0 for (i in seq_along(nm)) {<\/pre>\n<pre>\u00a0 \u00a0 plots &lt;-ggplot(X, aes_string(x = \"O_C\", y = \"H_C\")) +<\/pre>\n<pre>\u00a0 \u00a0 \u00a0 geom_point(aes_string(color = \"group\", size = nm[i]), alpha = 1\/4) +<\/pre>\n<pre>\u00a0 \u00a0 \u00a0 coord_cartesian(xlim = c(0, 2.0), ylim = c(0.5, 2.5)) +<\/pre>\n<pre>\u00a0 \u00a0 \u00a0 labs(x = \"Oxygen-to-Carbon Ratio\", y = \"Hydrogen-to-Carbon Ratio\", title = paste(\"Van Krevelen Plot\", nm[i], \"- QA1\", sep = \" \"))<\/pre>\n<pre>\u00a0 \u00a0 ggsave(plots, filename = paste(\"VK_\", nm[i], \".jpg\", sep=\"\"))<\/pre>\n<pre>\u00a0 }<\/pre>\n<pre>}<\/pre>\n<p>.<\/p>\n<pre>plotVK(ML_VK)<\/pre>\n<div>.<\/div>\n<div>\n<div id=\"attachment_446\" style=\"width: 870px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/VK_S0625a.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-446\" class=\"wp-image-446 size-large\" src=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/VK_S0625a-1024x731.jpg\" alt=\"VK_S0625a\" width=\"860\" height=\"614\" srcset=\"http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/VK_S0625a-1024x731.jpg 1024w, http:\/\/chem.sites.mtu.edu\/mazzoleni\/wp-content\/uploads\/2016\/02\/VK_S0625a-300x214.jpg 300w\" sizes=\"auto, (max-width: 860px) 100vw, 860px\" \/><\/a><p id=\"caption-attachment-446\" class=\"wp-caption-text\">An example from the plotVK function.<\/p><\/div>\n<\/div>\n<div>.<\/div>\n<div><\/div>\n<div>Similarly, you can adapted the second looping function to make the Kendrick plots. \u00a0After successfully executing these functions, it\u2019s time to start reviewing the plots. \u00a0I placed the files into a powerpoint file, to review them one by one. \u00a0However, these looping functions can\u00a0be altered to be used in an <a href=\"http:\/\/rmarkdown.rstudio.com\/\" target=\"_blank\">R Markdown<\/a> document. \u00a0Thus <span style=\"text-decoration: underline;\">saving even more time<\/span>! \u00a0Here&#8217;s an example (notice <em>ggsave()<\/em> was replaced with <em>print()<\/em>):<\/div>\n<div>\n<div>.<\/div>\n<pre>plotMassSpectra &lt;- function(X, na.rm = TRUE, ...) {<\/pre>\n<pre>\u00a0 nm &lt;- names(X)<\/pre>\n<pre>\u00a0 for (i in seq_along(nm)) {<\/pre>\n<pre>\u00a0 \u00a0 plots &lt;-ggplot(X, aes_string(x = \"theor_mass\", xend = \"theor_mass\", y = 0, yend = nm[i])) + geom_segment(aes_string(color = \"group\")) +<\/pre>\n<pre>\u00a0 \u00a0 \u00a0 labs(x = \"m\/z\", y = \"Relative Abundance\", title = paste(\"Reconstructed Mass Spectra\", nm[i], \"- QA1\", sep = \" \"))<\/pre>\n<pre>\u00a0 \u00a0 print(plots)<\/pre>\n<pre>\u00a0 }<\/pre>\n<pre>}<\/pre>\n<div>.<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Working with a large number of samples and\u00a0many variables can be especially challenging. \u00a0In our research, we typically have a few thousand variables (identified molecular formulas from mass spectra) for just a few samples. \u00a0But recently, we stepped up our game and started working with a dataset that had 42 samples and close to 6000 variables.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[53],"tags":[56,59,55,58,54,57,60,23],"class_list":["post-444","post","type-post","status-publish","format-standard","hentry","category-how-to","tag-dplyr","tag-functions","tag-ggplot2","tag-looping","tag-r","tag-ultrahigh-resolution-ms","tag-with","tag-zhao"],"aioseo_notices":[],"views":18404,"_links":{"self":[{"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/posts\/444","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/comments?post=444"}],"version-history":[{"count":22,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/posts\/444\/revisions"}],"predecessor-version":[{"id":471,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/posts\/444\/revisions\/471"}],"wp:attachment":[{"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/media?parent=444"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/categories?post=444"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/chem.sites.mtu.edu\/mazzoleni\/index.php\/wp-json\/wp\/v2\/tags?post=444"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}