Most computer programs usually runs in a single thread (sequential execution flow). But the main program can be splited in many threads so they can be concurrently executed, which improves algorithm performance (specially on multi-core CPU’s, where those threads are really executed simultaneously). In this post, I’m gonna show how to parellelize Ruby Web Crawler code with threads.
In Ruby’s standard library, the threads fired by a program can only take advantage from one CPU core due Global Interpreter Lock (GIL), that ensures only one Ruby code runs at same time. Even so, you can improve its performance in some cases where your code makes multiple I/O calls (like HTTP requests to an extern API, for example) which blocks the program, waiting the response to make a new request. When you use multiple threads, you can fire another request while waiting response, optimizing the real execution time.
Execution flow with Global Interpreter Lock
When you code using this programming paradigm, you should be aware about the concurrent execution flow. That means you don’t have any assurance about the threads execution order, which may cause race conditions, when you can get an unexpected results depending on thread execution order. For our web crawler, each request runs independently, so that’s not a problem. The block code that retrieves the blog post is the one which will be fired for each thread.
THE SOURCE CODE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#! /usr/bin/env ruby require 'nokogiri' require 'open-uri' require 'thread' require 'thwait' categorias_urls = [ "http://www.acasadocogumelo.com/search/label/Mario", "http://www.acasadocogumelo.com/search/label/Pok%C3%A9mon", "http://www.acasadocogumelo.com/search/label/Donkey%20Kong", "http://www.acasadocogumelo.com/search/label/Zelda" ] threads_posts = [] posts = [] categorias_urls.each do |url| # Fetch and parse HTML document doc = Nokogiri::HTML(open(url)) #Fetching post list for each category posts_summary = doc.css(".post-summary").search("strong").search("a").map{|a| [a.attributes["title"].value, a.attributes["href"].value]} #Getting post information posts_summary.each do |post| threads_posts << Thread.new{ titulo_post = post[0] link_post = post[1] post = Nokogiri::HTML(open(link_post)) post_body = post.css(".post-body.entry-content").first.inner_html posts << {titulo: titulo_post, link: link_post, conteudo: post_body, categoria: url} } end #waiting all threads finish to go on ThreadsWait.all_waits(*threads_posts) end puts posts |
There were only a few changes on the original source code. In the header, two libraries are required: thread and thwait (which allows to set a point for waiting all threads finish to continue). All threads created are pushed into an array to use as argument for the waiting function. In this case, I keep an array with all posts collected to demonstrate that in some cases you may want all threads are finished, in order to assure your array is correctly populated. But if you were to store those posts in a database, for example, that wouldn’t be necessary.
You could also, in this same example, to parallelize the requests for categories page, but I preferred to keep it simple. As this list size increases, the performance improvement will worth it. There are another libraries/gems for working with threads and parallel computing that can be more suitable according to your application. If you have any doubts or suggestions, please use the comment area or contact me.
1 Comment
Sannytet
December 12th, 2018 at 05:01Nice posts! 🙂
___
Sanny