All for Joomla All for Webmasters
Data Mining Ruby

Udemy: Mining course list with a crawler in Ruby

Udemy is one of the most popular online course platforms. Personally, I like it more because of its usability (both web and mobile), lifelong access and it also has some really good paid courses that occasionally becomes available for free. Recently I added a free courses page to my blog, which has Udemy as first source for the course list. I’m gonna show in this post how to implement the script to gather the free course list.

In some previous posts, we saw some similar applications in Ruby using Nokogiri and Mechanize. In this case, however, those libraries weren’t effectives and the requests returned Bad Request errors (400). So I needed to go deeper and analyze HTTP request trace on web browser’s console, inspect cookies and headers to reproduce the same browsing flow. Even so, it still gave me Bad Request errors.

Doing some new tests, I managed to do the same request on Linux console using curl (web browsers already have a debug function to get the request as a curl command). This time, it worked: it returned a course list in JSON format. So I decided to use Ruby’s default HTTP library to reproduce the request sent with curl (there’s a online service that converts a curl command to Ruby code):

THE SOURCE CODE

Explaining most important points:

  • The URL used as argument in that function was obtained inspecting the browser console (it’s highly recommended to explore along the request to understand the HTTP protocol better). That URL contains the path for IT category (each one has its own) and the page to be retrieved from the list (initialized as zero for the first page).
  • The first lines of the code (up to line 27) was obtained converting the HTTP request inspected to Ruby code library, using the service I mentioned before.
  • Since the response is a JSON, I used the default JSON library for Ruby to parse the data to a local variable. I used the instance variable “@courses” to keep all results in a hash containing the data I needed (lines 34-50)
  • The responses are paginated, which means you can’t get all the list at once and have to request page to page to get the whole data. Each response has the link to the next page to make it easier. The function for gathering data is called while the response still has a new page to be retrieved.

In this post I’ve tried to keep the script short and straight to make it easier to understand. Anyway, in the real application (this blog’s page), there’s some additional work to persist that list in a database where it can be retrieved lately. The inspection method used here to get these objects can be extended to any platform that allows searching/filtering elements. If you have any doubts or suggestions, please use the comment area or contact me.

You Might Also Like

1 Comment

  • Reply
    Sannytet
    December 11th, 2018 at 23:40

    Nice posts! 🙂
    ___
    Sanny

  • Leave a Reply