percolated-twitter.rb |
|
|---|---|
Reversed or “Real Time” Search in ElasticSearch |
|
|
You may have come across the term “realtime search” lately (eg. here) and wondered what all the fuss is about. Well, the usual workflow with search engines goes like this:
In ElasticSearch, there’s a default delay of 1 second between indexing a document and being able to search it. You can, of course, force refresh the index. That’s enough “realtime” in my book, if you ask me. However, ElasticSearch comes with a much more powerful idea for “realtime seach”, called percolation. It reverses the usual workflow like this:
The search engine will match your documents against your queries, and return names of matching queries in the response. Yes, names. This allows you to do crazy stuff immediately, while your documents are being indexed. Think Google Alerts for your data. The Tire gem for ElasticSearch recently got percolation support). In this script, we’ll register couple of queries, and then fetch data from Twitter, receiving notifications when any status message matches one of our queries. You can download this script at https://gist.github.com/1025498. |
|
|
First of all, let’s install some gems:
Note that you need the 0.1.11 version of the |
require 'rubygems'
require 'time'
require 'uri'
require 'tire'
require 'ansi/code'
require 'yajl/http_stream'
include ANSI::Code |
|
Let’s define a class to hold our data in ElasticSearch. We’re using the persistence mode of Tire, so we don’t need a database, because the index is our database. |
class Status
include Tire::Model::Persistence
property :user
property :text
property :created_at |
|
Let’s define callback for percolation.
Whenewer a new document is saved in the index, this block will be executed,
and we will have access to matching queries in the In our case, we will just print the list of matching queries. |
on_percolate do
puts green { "'#{text}' from @#{user} matches queries: #{matches.inspect}" } unless matches.empty?
end
end |
|
Let’s register the queries for percolation now. First, let’s define the query_string queries. |
q = {}
q[:newspeak] = 'wow omg lol wtf fuu*'
q[:fail] = 'fail'
q[:memes] = '"why u no" "all your base" "i can has"' |
|
Second, let’s save those queries in ElasticSearch. |
Status.index.register_percolator_query('newspeak') { |query| query.string q[:newspeak] }
Status.index.register_percolator_query('fails') { |query| query.string q[:fail] }
Status.index.register_percolator_query('memes') { |query| query.string q[:memes] }
puts "", bold { "Testing percolation" }, '-'*80 |
|
Let’s check out the percolation on some “example data”. |
status = Status.new :text => 'OMG i can has #fail'
puts "'#{status.text}' matches queries #{status.percolate.inspect}" |
|
Now, let’s fetch some real data from the collective consciousness. |
url = URI.parse("http://api.twitter.com/1/statuses/public_timeline.json")
puts "", bold { "Fetching '#{url}'" }, '-'*80
5.times do |i| |
|
Get JSON data from Twitter. |
Yajl::HttpStream.get(url, :symbolize_keys => true) do |timeline|
timeline.each do |status| |
|
Create new document from each status message. Watch the output from the |
Status.create :id => status[:id],
:user => status[:user][:screen_name],
:text => status[:text],
:created_at => Time.parse(status[:created_at])
end
puts "Indexed #{timeline.size} tweets"
end
sleep 10 if i < 4
end
puts "", bold { "Check out your index" }, '-'*80 |
|
You can check out the the documents in your index with |
puts "curl 'http://localhost:9200/statuses/_search?q=*&sort=created_at:desc&size=5&pretty=true'", "" |