An example of web scraping with R: Online Food Blogs

In this blog post I will discuss web scraping using R. As an example, I will consider scraping data from online food blogs to construct a data set of recipes. This data set contains ingredients, a short description, nutritional information and user ratings. Then, I will provide a simple exploratory analysis which provides some interesting insights.

The code and notebooks (R markdown) for the analysis and web scraping are included in my repository. If you come across this blog and have some ideas, or independent projects, please let me know for a possible collaboration.

Web Scraping

With numerous food blogs and web sites with lots of recipes, the web provides a great resource for mining food and nutrition based data. As a fun project, I took on this idea and created a simple repository containing the code for scraping food blog data. The functions that scrape the web data are in the script “utilities.R” and uses the R packages rvest, jsonlite and the tidyverse set. The website I have chosen to extract data from is called Pinch of Yum, which contains many recipes with beautiful photos accompanying them (This calls for another project idea using image recognition!).

The strategy that I used to scrape the data was to first understand the general outline of how recipes are stored in the website. Once you go the main page and click recipes, one can see that there are 50 pages (at the time I obtained the data), each containing 15 recipe links. So, I basically skimmed through the html source of the main page (which you can obtain from your browser) and identified the locations of the hyperlinks to each recipe. Then, I wrote a simple function to locate these links automatically in all the 50 pages.

Each hyperlink contains html tags that we need to remove for further processing. Below is a code snippet that exactly does that (a simplified version of the one in my repo):

trim_ <- function(link){
   temp1 <- str_split(link, " ")[[1]][3] %>%
   str_replace_all("\"", "") %>% # Remove \'s
   str_replace("href=", "") %>%
   str_replace(">", " ")
   temp1 <- str_split(temp1, " ")[[1]][1]
   temp1
}

Given a link to a recipe obtained from the html source, this function simply cleans the html tags and returns a simple text for each recipe location that we can later use to connect to. This function is used in another function below, which locates the recipes in each of the 50 pages. Again, the function below is a simplified version of the one included in my repo.

get_recipe_links <- function(page_number){
   page <- read_html(paste0("http://pinchofyum.com/recipes?fwp_paged=", 
                     as.character(page_number)))
   links <- html_nodes(page, "a")

   # Get locations of recipe links
   loc <-which(str_detect(links, "<a class"))
   links <- links[loc]

   # Trim the text to get proper links
   all_recipe_links <- map_chr(links, trim_)

   # Return
   all_recipe_links
 }

This function takes as an input the page number (1 to 50) and uses the “read_html” function to get the html source code. Since each page contains 15 recipes, we need to locate the links to them. The variable “links” does that by locating the html nodes that contains links using the function “html_nodes” and selecting the nodes “a”. Scanning through the html source, I realized that the recipe links contain the expression “<a class”, so I used it as a regular expression to locate them and store them in the variable “loc”. After that, I selected the nodes which contain the recipes (next line). Finally, using the trimming function above, all the html tags from the recipes (using the map_chr function) are removed. As a result this function returns a vector of links to each recipe in a given page.

Now, the next step is to connect to all links returned by “get_recipe_links” and then scrape the recipe data one by one. In this case I was very lucky, since the recipe data was stored in JSON format in the html source, which made the job very easy.

get_recipe_data <- function(link_to_recipe){
  page <- read_html(link_to_recipe)
  script_ <- html_nodes(page, "script")
  loc_json <- which(str_detect(script_, "application/ld"))
  recipe_data <- fromJSON(html_text(script_[loc_json]))
  ...
  ...

In this function “link_to_recipe” is a link returned from “get_recipe_links”. First, the page in this link is obtained and then the location of the JSON data is located under the node “script”. The JSON containing the recipe data has the expression “application/ld” which is used to locate the exact location. Then, the data is simply parsed by the “fromJSON” function. I left the rest of the code out, since it is kind of long, however easy to understand. What happens next is that features from JSON is obtained and stored in a data frame which this function returns. The only cumbersome part here was that the JSON data was not uniquely formatted across the whole site, so I had to insert many control statements to tackle with this issue, which you can see in the full code in the repo.

Now that we have these functions, we can scrape the data. Using

all_links <- 1:50 %>% map(get_recipe_links) %>% unlist()

one can get all the links to the recipes. Then using the following inside a for loop (after initiating “all_recipes_df”)

all_recipes_df <- rbind(all_recipes_df, 
                        get_recipe_data(link_to_recipe))

one obtains all the recipes in a data frame. I specifically used a for loop instead of something like “map_df”, since I want the progress to be printed on the screen when each recipe link is connected. All these are done in the script “scrape.R” in my repo.

At the end, all the recipes are stored in a data frame “all_recipes_df” which contains lots of interesting information. Below, I will discuss very briefly a simple analysis that can be done with this data.

Exploratory Analysis

I have written a detailed markdown document that performs the analysis which can be found in my repo, and also located in Rpubs. So, I will only discuss some of the results here.

Let’s look at the distribution of ratings in the website:

ratings

Users can rate a recipe from 1 to 5 stars. As can be seen, pretty much all ratings are close to 5 stars (median value is 4.8). This is not a surprise, since individual food blogs tend to have a relatively small following and the followers are those who enjoy the recipes so they rate them high. This makes modelling ratings with features from the recipes rather hard, so instead, I looked at distribution of words in all the recipes. I have made use of the tidytext and tokenizers packages, which have been really useful. After some data munging, I performed principal components analysis of words appearing in the recipes and obtained some interesting results. For example, the following plot shows the vectors of all the words projected on the first two principal components:

pcIngredients

I find this plot rather fascinating since it captures some interesting effects:

  • Ingredient vectors used in baking tend to be close to each other (milk, sugar, butter, flour etc.)
  • Cheese, slice and shredded are close to each other (for obvious reasons)
  • Garlic, minced, cloves, olive, oil are close to each other
  • All of these groups of vectors point along different directions

This is in fact similar to what one would observe in word2vec models.

One problem with this data was the fact that more than half of the entries lacked nutritional information. One would expect a strong correlation between nutritional values and ingredients (unlike biased ratings) which could have led to a more interesting analysis.

Final Words

Online food blogs provide a great resource for data mining and and exploration. What is outlined here only scratches the surface for what can be done with this data. I have several ideas which, in my opinion, could be quite interesting to explore:

  • Scraping more data from food blogs and combining with the current data set. I found that many sites use a similar JSON format for recipe data.
  • Using images contained in the blogs to perform image classification (e.g. high calorie food detection)
  • Using data from Food Network, AllRecipes etc. (One may need to ask for permission to use the data in these cases).

As mentioned in the beginning, if there are any other ideas, or existing personal projects, please feel free to contact me for collaboration!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s