Monday, December 10, 2012

Big data @ FourSquare

I have attended a meetup sometime back at FourSquare location regarding their big data usage. If you want to know what technology they use, this blog is for you.

This is talk given by Blake Shaw - data scientist and Joe Crobak -engineer from foursquare

FourSquare has 20m+ users and it records 1500 actions per second. Blake talked about mining signals from check-ins and build social recommendation engine on top it ( they called it Explorer).

He talked about the place. The place is not only a physical entity but also a place where people meet and interact in a certain timeframe.

Blake showed one of the map created by million of people who checkin at one location. It was the map of central park created by checkin!!!

They collect, analyze and report on time signature for places.

By mining the check-in data, they can discover the correlation between event and sale. For example - hot weather is correlated with the high checkin near ice cream shops. They can recommend people place to go based on weather pattern.

They also analyze the sentiment data - called happiness on foursquare. They analyze the comments people are making during their checkin and discover the negative or positive sentiments.

Blake uses algorithms of finding similar items -Critical for their recommendation engine. Large sparse k-nearest Neighbors is one of the main algorithm.

He is also computing venue similarity for the recommendation purposes.

Blake also showed the excellent visualization of historical checkin on the world map at different times. He used Matlab to create that visualization.

Joe Crobak talked about the hadoop infrastructure in details -

Th infrastructure handles ~1.5B log events per week

Infrastructure components -

Cloudera CDH3u3
12 node Hadoop cluster in EC2
Hive
Pig
Solr
mongoDB
Ozzie
Scala
Google spreadsheet -automated reporting
Scoobi - hadoop MR connector for Scala

They started with elastic map reduce (EMR) and faced few limitations and moved to their own Hadoop cluster in AWS. One of the limitation was that people are running Hive query very frequently, so they need a Hadoop cluster that is running all the time.

Overall, they have been using big data to drive insight from petabyte of data successfully.

No comments: