Cristian Guajardo Garcia

Business Strategy

where data and creativity collide

The Big Data platform

The Ghost Logo

What is it:

The use of huge amounts of “unused” data (structured, unstructured and semi-structured) in order to be applied in BI (business intelligence); you know, the right decision taken by the right people at the right time

How is it done:

Hadoop is the name that comes to mind most naturally (probably to us all) However, I have learn that the whole is more than each part. Hadoop is an essential part on the Big Data platform, but still, is just a part.

So, how does it look? (and how does it work)

So, the first part will have to do with the input of data and the way it comes (structured or not; still or in real time). Once this data is coming, you will have to select built-for-purpose engines. Big Insight by IBM works using some sort of “App store” made with open source engines designed to solve specific business needs.

Let’s take Hadoop for instance. It works absorbing huge amounts of data which will be divided using its HDFS “working mode” (i'm lacking for a better word). When performing this task, Hadoop will use several computers which will handle small buckets of data. The distributed file system will allow to analyze the whole universe and not just a sample (which is the way that has been historically done). Let’s think of Twitter; When you use the API you are collecting only 1% of the mentions, but when you use the firehose you will get unstructured (or semi-structure) real time data that will take your analysis to a whole new level, since you will have acess to the whole mentions.

And then, it’s time for governance and info integration. You have to validate the veracity of the data (along with velocity, variety and volume create the 4 v’s of Big Data) and analyze the results (this is when Watson kicks our human butts).

What is so interesting is the way all of this works. Let's take SQL, “the excel file” as we know it. And at the same time, let's say Excel files are now insufficient.

Let's pretend you have an Excel file named “people” (that is why you are not an artist) and there you have the data structured as follows: “name”, “surname”, “nationality”, “age”, “profession” and so on. Let’s suppose you have 5,000 rows of structured data. But what happens if that Excel file is related to another ones? Then you have NoSQL. Non related files will start mixing up. Let’s suppose now you have another Excel file with 10,000 rows of info for cars, which goes something like: “brand”, “model”, “color”, “year”, “motor” and so on. And a third one! whith consuming habits: “supermarket”, “retail”, “online”, “cash”, “debit card” and some other semistructure data.

What do you do with it? You ask for help to JSON.

So, you have the “people” (and every other) file which will go something like this: people={name:"cristian",sourname: "guajardo",country: "chile"} cars={brand:"toyota", model:"prius", engine:"hybrid"} buying={supermarket:"penny", creditcard:"visa",prefered:"cash"}

And then, we could eventually mixed them all in 1 line: {“people”,”cars”,”buying”}

So eventually, you will have folders inside of folders with so much information ready to be processes. At the end, you could segment men with blue jeeps and credit card buying habits. Nice, uh?

Ok, if I made a mistake, before calling all the neighbors and run here to burn me alive, check this video that puts the whole definition quite nicely in a very simple way.

That would be using “unrelated” and not use information. That would be putting some brains in your data.

Other tools to increase the power of your big data efforts? Here you go.

  • Eclipse: popular Integrated Development Environment
  • Lucene: text search engine library written in Java.
  • HBase: is the Hadoop database.
  • Hive: provides data warehousing tools to extract, transform and load data, and then, query this data stored in Hadoop files.
  • PIG: high level language that generates
  • MapReduce: code to analyze large data sets.
  • Jaql: query language for JavaScript open notation.
  • Zookeeper: centralized configuration service and naming registry for large distributed systems.
  • Avro: data serialization system.
  • UIMA: architecture for the development, discovery, composition and deployment for the analysis of unstructured data

Hope the whole thing is clearer. I will come back soon with another post.

Cristian Guajardo Garcia (cc) by-nc-sa | Made in London, UK |  2005 - 2017