Udemy: Mining course list with a crawler in Ruby

Posted on January 11th, 2018

Udemy is one of the most popular online course platforms. Personally, I like it more because of its usability (both web and mobile), lifelong access and it also has some really good paid courses that occasionally becomes available for free. Recently I added a free courses page to my blog, which has Udemy as first source for the course list. I’m gonna show in this post how to implement the script to gather the free course list.

In some previous posts, we saw some similar applications in Ruby using Nokogiri and Mechanize. In this case, however, those libraries weren’t effectives and the requests returned Bad Request errors (400). So I needed to go deeper and analyze HTTP request trace on web browser’s console, inspect cookies and headers to reproduce the same browsing flow. Even so, it still gave me Bad Request errors.

Doing some new tests, I managed to do the same request on Linux console using curl (web browsers already have a debug function to get the request as a curl command). This time, it worked: it returned a course list in JSON format. So I decided to use Ruby’s default HTTP library to reproduce the request sent with curl (there’s a online service that converts a curl command to Ruby code):

THE SOURCE CODE

#! /usr/bin/env ruby
require 'open-uri'
require 'net/http'
require 'uri'
require 'json'

@courses = []

def retrieve_courses(url="/api-2.0/channels/1646/courses?is_angular_app=true&lang=en&price=price-free&p=0")

  uri = URI.parse("https://www.udemy.com#{url}")
  request = Net::HTTP::Get.new(uri)
  request["Host"] = "www.udemy.com"
  request["User-Agent"] = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"
  request["Accept"] = "application/json, text/plain, */*"
  request["Accept-Language"] = "en-US,en;q=0.5"
  request["Referer"] = "https://www.udemy.com/courses/it-and-software/all-courses/?price=price-free"
  request["X-Requested-With"] = "XMLHttpRequest"
  request["Dnt"] = "1"
  request["Connection"] = "keep-alive"

  req_options = {
    use_ssl: uri.scheme == "https"
  }

  response = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|
    http.request(request)
  end

  course_data = JSON.parse(response.body)

  courses = []

  course_data["results"].each do |course|
    @courses << {
      id: course["id"],
      title: course["title"],
      headline: course["headline"],
      num_lectures: course["num_published_lectures"],
      time: course["content_info"].split[0],
      url: course["url"],
      image_thumb: course["image_125_H"],
      image_full: course["image_480x270"],
      subscribers: course["num_subscribers"],
      rating: course["avg_rating_recent"],
      num_reviews: course["num_reviews"],
      captions: course["caption_languages"].join(","),
      language: course["locale"]["title"]
    }
  end

  return (course_data["pagination"]["next"] ? course_data["pagination"]["next"]["url"] : nil)

end

url = retrieve_courses
while url
  url = retrieve_courses(url)
end

puts @courses

#! /usr/bin/env ruby

require 'open-uri'

require 'net/http'

require 'uri'

require 'json'

@courses = []

def retrieve_courses(url="/api-2.0/channels/1646/courses?is_angular_app=true&lang=en&price=price-free&p=0")

uri = URI.parse("https://www.udemy.com#{url}")

request = Net::HTTP::Get.new(uri)

request["Host"] = "www.udemy.com"

request["User-Agent"] = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"

request["Accept"] = "application/json, text/plain, */*"

request["Accept-Language"] = "en-US,en;q=0.5"

request["Referer"] = "https://www.udemy.com/courses/it-and-software/all-courses/?price=price-free"

request["X-Requested-With"] = "XMLHttpRequest"

request["Dnt"] = "1"

request["Connection"] = "keep-alive"

req_options = {

use_ssl: uri.scheme == "https"

}

response = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|

http.request(request)

end

course_data = JSON.parse(response.body)

courses = []

course_data["results"].each do |course|

@courses << {

id: course["id"],

title: course["title"],

headline: course["headline"],

num_lectures: course["num_published_lectures"],

time: course["content_info"].split[0],

url: course["url"],

image_thumb: course["image_125_H"],

image_full: course["image_480x270"],

subscribers: course["num_subscribers"],

rating: course["avg_rating_recent"],

num_reviews: course["num_reviews"],

captions: course["caption_languages"].join(","),

language: course["locale"]["title"]

}

end

return (course_data["pagination"]["next"] ? course_data["pagination"]["next"]["url"] : nil)

end

url = retrieve_courses

while url

url = retrieve_courses(url)

end

puts @courses

Explaining most important points:

The URL used as argument in that function was obtained inspecting the browser console (it’s highly recommended to explore along the request to understand the HTTP protocol better). That URL contains the path for IT category (each one has its own) and the page to be retrieved from the list (initialized as zero for the first page).
The first lines of the code (up to line 27) was obtained converting the HTTP request inspected to Ruby code library, using the service I mentioned before.
Since the response is a JSON, I used the default JSON library for Ruby to parse the data to a local variable. I used the instance variable “@courses” to keep all results in a hash containing the data I needed (lines 34-50)
The responses are paginated, which means you can’t get all the list at once and have to request page to page to get the whole data. Each response has the link to the next page to make it easier. The function for gathering data is called while the response still has a new page to be retrieved.

In this post I’ve tried to keep the script short and straight to make it easier to understand. Anyway, in the real application (this blog’s page), there’s some additional work to persist that list in a database where it can be retrieved lately. The inspection method used here to get these objects can be extended to any platform that allows searching/filtering elements. If you have any doubts or suggestions, please use the comment area or contact me.

Ronan Lopes

CTO at FIT Energia. Msc in Computer Science from the Federal University of São João del-Rei (UFSJ). Took a specialization course in Data Science & Big Data at PUC Minas . Linux enthusiast and supporter of open-source software. A buddhist who learns from Lama Padma Samten. In his spare time he draws something, solves magic cubes and enjoys some nice beers with his friends.

1 Comment

Sannytet

December 11th, 2018 at 23:40

Nice posts! 🙂
___
Sanny

Udemy: Mining course list with a crawler in Ruby

THE SOURCE CODE

Ronan Lopes

You Might Also Like

Tweepy: Tweet Collector in Python

Tweepy: Retrieving Trending Topics on Twitter for a specific location

Threads in Ruby: sending parallel HTTP requests

1 Comment

Sannytet

Leave a Reply Cancel Reply