Big Data Analytics using Hadoop

Big Data

Big data is a term used to refer data that is so large and complex that it cannot be stored or processed using traditional database software.

Big Data Analytics

Big data analytics is the process of analyzing big data to discover useful information.

Problems with Big Data

  • Storing exponentially growing large volumes of data (E.g., Social media data)
  • Processing data having complex structure (unstructured data)
  • Processing data faster

Apache Hadoop

Apache Hadoop is an open source framework that allows us to store and process big data in parallel and distributed fashion on computer clusters.

Hadoop has three main components.

  • Hadoop Distributed File System (HDFS – storage)

    It allows us to store any kind of data across the cluster.

  • Map / Reduce (Processing)

    It allows parallel processing of the data stored in HDFS.

  • YARN

    It is the resource management and job scheduling technology for distributed processing.

Running Map / Reduce Programs in Java

Hadoop is written originally to support Map / Reduce programming in Java language but it allows writing Map/Reduce programs in any language using an API called Hadoop Streaming. Let’s test some hadoop Map / Reduce programs on the Data Science Unit (DSU) cluster.

Word Count Program (Java)

It reads a text file and counts how often words occur.

Let’s run a wordcount.java program using the DSU-Hadoop Cluster


Steps to run the program are given below:

  • Get a text file (sample_text.txt) and save it on the Desktop

  • Put the file in the HDFS.

    dsu@master: ~$hadoop fs –put /home/dsu/Desktop/sample_text.txt /user/WordCount

    Here, /user/WordCount is the directory in HDFS where your file will be stored.

  • Create a jar file for wordcount.java and save it on the Desktop

    There are several ways to create a jar file (E.g. Using Eclipse). You may find details online for creating a jar file.

  • Run the program

    dsu@master:~$hadoop jar Desktop/wordcount.jar /user/sample_text.txt /user/WordCount

    The output will be stored inside “/user/WordCount” directory.

  • View the output

    dsu@master:~$hadoop fs –cat /WordCount/part-r-00000

    By default, Hadoop names the output file as part-r-00000.


The word count program for java is given below.


package wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;

public class WordCount {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
   	 
   	 public void map(LongWritable key, Text value, Context context) throws
   	 IOException, InterruptedException
   	 {
   		 String line = value.toString();
   		 StringTokenizer tokenizer = new StringTokenizer(line);
   		 while (tokenizer.hasMoreTokens())
   		 {
   			 value.set(tokenizer.nextToken());
   			 context.write(value, new IntWritable(1));
   		 }
   	 }
    }
    
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
   	 
   	 public void reduce (Text key, Iterable<IntWritable> values, Context context)
   	 throws IOException, InterruptedException {
   		 int sum=0;
   		 for (IntWritable x: values)
   		 {
   			 sum += x.get();
   		 }
   		 context.write(key, new IntWritable(sum));
   	 }
    }
    
    public static void main (String [] args) throws Exception {
   	 
   	 //Reads the default configuration of cluster from the configuration xml files
   	 Configuration conf = new Configuration();
   	 
   	 //Initializing the job with the default configuration of the cluster
   	 Job job = Job.getInstance(conf, "WordCount");
   	 
   	 //Assigning the driver class name
   	 job.setJarByClass(WordCount.class);
   	 
   	 
   	 //Defining the mapper class name
   	 job.setMapperClass(Map.class);
   	 
   	 //Defining the reducer class name
   	 job.setReducerClass(Reduce.class);
   	 
   	 //Key type coming out of mapper
   	 job.setMapOutputKeyClass(Text.class);
   	 
   	 //value type coming out of mapper
   	 job.setMapOutputValueClass(IntWritable.class);
   	 
   	 //Defining input Format class which is responsible to parse the dataset into a key value pair
   	 job.setInputFormatClass(TextInputFormat.class);
   	 
   	 //Defining output Format class which is responsible to parse the dataset into a key value pair
   	 job.setOutputFormatClass(TextOutputFormat.class);
   	 
   	 //Setting the second argument as a path in a path variable
   	 Path OutputPath = new Path(args[1]);
   	 
   	 //Configuring the input path from the file system into the job
   	 FileInputFormat.addInputPath(job,  new Path(args[0]));
   	 
   	 //Configuring the output path from the file system into the job
   	 FileOutputFormat.setOutputPath(job,  new Path(args[1]));
   	 
   	 //Deleting the context path automatically from hdfs so that we don't have to delete it explicitly
   	 OutputPath.getFileSystem(conf).delete(OutputPath, true);
   	 
   	 //Exiting the job only if the flag value becomes false
   	 System.exit(job.waitForCompletion(true) ? 0:1);
   	
    }

Hadoop Streaming

Hadoop Streaming is a utility that comes with the Hadoop distribution. It allows us to write Map / Reduce programs in any language

Running Map / Reduce Programs in Python using Hadoop Streaming

Let’s code the word count program in Python and run it using Hadoop Streaming.


Mapper.py

#!/usr/bin/env python
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    
    for word in words:
        print '%s\t%s' % (word,1)

Reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

In the previous example, we uploaded the sample text file to the HDFS. Save the mapper.py and reducer.py in a folder (E.g., /home/dsu/Desktop/mapper.py, /home/dsu/Desktop/reducer.py).

Run the program as follows.

dsu@master:~$ hadoop jar /home/dsu/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.0.jar –file /home/dsu/Desktop/mapper.py –mapper mapper.py –file /home/dsu/Desktop/reducer.py –reducer reducer.py –input /user/WordCount/sample_text.txt –output /user/WordCount/wordcount_py

The output will be stored as part-0000 by default inside /user/WordCount/wordcount_python directory in HDFS.

You can view the output as follows.

dsu@master:~$hadoop fs –cat /user/WordCount/wordcount_python/part-00000

Even though Hadoop streaming is a standard way to code map-reduce programs in languages other than Java, some other APIs such as Mrjob, Hadoopy and Pydoop also support Python map-reduce programming as well.

Running Map / Reduce Programs in R using Hadoop Streaming

In this section, we will code the above “word count” program in R using Hadoop Streaming. You also can use an API called RHadoop to code map-reduce programs in R.


Mapper.R

#! /usr/bin/env Rscript
# mapper.R - Wordcount program in R
# script for Mapper (R-Hadoop integration)
 
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
 
con <- file("stdin", open = "r")
 
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
 
    for (w in words)
        cat(w, "\t1\n", sep="")
}
close(con)

Reducer.R

#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)
 
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
 
splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}
 
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
 
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
 
    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)
 
for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

Run the code as follows.

dsu@master:~$ hadoop jar /home/dsu/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.0.jar –file /home/dsu/Desktop/mapper.R –mapper mapper.R –file /home/dsu/Desktop/reducer.R –reducer reducer.R –input /user/WordCount/sample_text.txt –output /user/WordCount/wordcount_R

You can view the output as we explained earlier.

Hadoop Ecosystem

The Hadoop ecosystem includes other tools to address particular needs. As of now, DSU Cluster accommodates Apache Pig, Apache Hive and Apache Flume.

Apache Pig

Apache Pig provides an alternative way to write map-reduce programs in a more simplified way using a language called pig-latin. Non- programmers can easily learn this language and code map-reduce programs in pig-latin. Two hundred lines of Java code can be replaced by 10 lines of pig-latin scripts. Pig includes inbuilt operations such as join, group, filter, sort and more so that we don’t have to write code for those operations. Pig sits on the top of Hadoop and uses Pig Engine to convert pig-latin scripts into map-reduce jobs before execution

Let’s code the “word count” program in Pig (word_count.pig).

lines = LOAD '/input_file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

Command to execute it.

dsu@master:~$pig word_count.pig

The output will be displayed on the terminal.

Apache Hive

Apache Hive is a data warehousing package built on top of hadoop for data analysis. Hive is targeted towards users comfortable with Structured Query Language (SQL). Hive uses a language similar to SQL called HiveQL. Hive abstracts the complexity of Hadoop so users do not need to write map-reduce programs. Hive queries will be automatically converted into map-reduce jobs.

SQL is designed for tables that reside in a single machine. In HDFS, table data is distributed across multiple machines. HiveQL is used to analyze data in distributed storage.

Hive is read-based and therefore not applicable for transaction processing that typically involves a higher percentage of write operations (e.g., ATM transactions). It is also not suitable for applications that need very fast response times as hadoop is intended for long sequential scans.

The following Demos use Pig and Hive.


Demo 1: Apache Hive

Load data from Local File System to HDFS and HDFS to Hive Table.


Step 1: Creating a table in Hive

Step 2: Copying the data from local file system to HDFS

Step 3: Loading the data from HDFS to Hive table


These steps are demonstrated in detail.

Step 1: Creating a table in Hive

(Inside Hive Shell)

hive> create database hive_demo;


List down all the available databases.

hive> show databases;


Use the newly created database

hive> use hive_demo;


Create a table inside the newly created database (hive_demo).

hive> create table user_details (
    > user_id int,
    > name string,
    > age int,
    > country string,
    > gender string
    > ) row format delimited field terminated by ','
    > stored as textfile;

Check whether the table is created using the following commands.

hive> show tables;

hive> desc user_details;


Step 2: Copying the data from local file system to HDFS

hadoop fs -put /home/dsu/Desktop/dsu_review/userdata /


Step 3: Loading the data from HDFS to Hive table

hive> load data inpath '/userdata' into table user_details;


Now we can analyse the data using Hive Query Language.

hive> select * from user_details;

hive> select name from user_details where age >= 18;


Demo 2: Clickstream Analysis using Apache Hive and Apache Pig

Our task is to list down the top 3 websites that are viewed by the people under16.


User Details (table_name = "user_details") Website Visited (table name = clickstream)
1,Dilan,20,SL,M 1,www.bbc.com
2,Nisha,15,SL,F 1,www.facebook.com
3,Vani,14,SL,F 1,www.gmail.com
4,Steno,12,SL,F 2,www.cnn.com
5,Ajith,17,SL,M 2,www.facebook.com
  2,www.gmail.com
  2,www.stackoverflow.com
  2,www.bbc.com
  3,www.facebook.com
  3,www.stackoverflow.com
  3,www.gmail.com
  3,www.cnn.com
  4,www.facebook.com
  4,www.abc.com
  4,www.stackoverflow.com
  5,www.gmail.com
  5,www.stackoverflow.com
  5,www.facebook.com

Here, there are 2 tables (user_details & clickstream).

Clickstream table lists down the websites viewed by each user.

User_details table is created in the previous demo.

We have to create clickstream table.

Please follow the steps.

hive> create table clickstream (userid int, url string)
    > row format delimited fields terminated by ','
    > stored as textfile;

hive> show tables;
    > desc clickstream;

$hadoop fs -put /home/dsu/Desktop/dsu_review/clickstream /

hive> load data inpath '/clickstream' into table clickstream;

hive> select * from clickstream;

Task 1: Analyze the data using Hive

hive> select url, count(url) c from user_details u JOIN clickstream c ON (u.user_id=c.userid) where u.age<16 group by url order by c DESC limit 3;

Answer:

www.stackoverflow.com	3
www.facebook.com	3
www.gmail.com	2

Task 2: Analyze the data using Apache Pig

Load user_details file to HDFS.

$hadoop fs -put /home/dsu/Desktop/dsu_review/userdata /

$hadoop fs -put /home/dsu/Desktop/dsu_review/clickstream /


Program (pig_demo.pig)

users = load '/userdata' using PigStorage(',') as (user_id, name, age:int, country, gender);
filtered = filter users by age <= 16;
pages = load '/clickstream' using PigStorage(',') as (userid, url);
joined = join filtered by user_id, pages by userid;
grouped = group joined by url;
summed = foreach grouped generate group , COUNT(joined) as clicks;
sorted = order summed by clicks desc;
top3 = limit sorted 3;
dump top3;

Command to execute it.

$cd /home/dsu/Desktop/dsu_review

$pig pig_demo.pig


Answer:

(www.facebook.com,3)
(www.stackoverflow.com,3)
(www.gmail.com,2)

Apache Flume

Apache Flume is an ecosystem tool used for streaming the log files from applications into HDFS. E.g. downloading tweets and storing them in HDFS.


Demo 3: Downloading Tweets from Twitter using Apache Flume

Steps in brief:

Step 1: Create a Twitter application from apps.twitter.com.


Step 2: Get the following credentials.

Consumer key

Consumer key (secret)

Access Token

Access Token Secret


Step 3: Prepare a configuration file to download tweets

Provide all the necessary details and the type of tweets that you want to download.

Provide all the necessary details and the type of tweets that you want to download.


Step 4: Run the configuration file

flume/bin$ ./flume-ng agent -n TwitterAgent -c conf -f ../conf/twitter.conf -Dtwitter4j.http.proxyHost=cachex.pdn.ac.lk -Dtwitter4j.http.proxyPort=3128

It will download tweets and store it in HDFS.


Apache HBase


SQL vs NoSQL Databases

SQL - Structured Query Language

NOSQL - Not Only SQL


Relational Database Management Systems (RDBMS) use SQL. Simple SQL queries (e.g. select * from [table name] are used to retieve data. RDBMS do not incorporate distributed storage (store data across multiple computers).

NoSQL databases address distributed storage. They do not have the SQL interface.


What is HBase?

HBase is a NoSQL database built on top of Hadoop Distributed File System (HDFS). Data is stored as key, value pairs. Both key and value are stored in byte arrays


Difference between HBase and RDBMS

 

HBASE RDBMS
Column-oriented Row-oriented (mostly)
Flexible Schema, add columns dynamically Fixed Schema
Good with sparse tables Not optimized for sparse tables (too many null values)
Joins using Mapreduce – not optimized Optimized for joins
Leverages batch processing with Mapreduce distributed processing No
Good for semi-structured data as well as structured data Good for structured data
Scales linearly and automatically with new nodes Usually scales vertically by adding more hardware resources

When to use HBase?
  • Processing huge data (in Terabytes / Petabytes usually)
  • Column-oriented data
  • Unstructured data
  • High throughput (millions of inputs/transactions per second)
  • Variable columns
  • Versioned data – stores multiple versions of data using timestamps
  • Need random reads and writes
  • Generating data from Mapreduce workflow

Difference between HBase and Hive

 

HBASE HIVE
Well suited for CRUD (Create, Read, Update, Delete) Not well suited for CRUD
Maintain versions Not supported
Less support for aggregations. E.g. find max/min/avg of a column Good support for aggregations
No table joins Supports table joins
Look up is very fast so that read activity is very fast Not fast

Two ways to use HBase
  • Using Command Line called HBase shell (create table, update data, etc)
  • Common way – interact with HBase through Java programs

HBase has a Java API. It is the only first class citizen. There are other programmatic interfaces (e.g. REST API) as well.

HBase does not have an inbuilt SQL interface. There are non-native SQL interfaces available for HBase (e.g. Apache Phoenix, Impala, Presto, & Hive).


HBase Demo using DSU Cluster


We are going to demonstrate how to create a table, store, update, retrieve & delete data, and drop a table using both of the above mentioned ways.

The following table will be used for the demonstration.

Employee (Table Name)
Personal (column family) Professional (column family)
eid name gender experience salary
'eid01' 'Rahul' 'male' 5 80000
'eid02' 'Priya' 'female' 2 50000

Column-family is used to group set of columns together. Column-family is not available in RDBMS.


CRUD using HBase Shell

You need to connect to the DSU master node and type ‘hbase shell’ in a terminal. It will open the hbase shell.

dsu@master:~$ hbase shell

hbase(main)>


Create a table

When creating a table, we only need to provide table-name and column-families. Column names are not required.


hbase(main)>create ‘table-name’, ‘column-family-1’, ‘column-family-2’,....’column-family-n’

hbase(main)> create 'employee', 'personal', 'professional'

To check whether the table is created, use ‘list’ command. ‘List’ command will display all the tables that are present in HBase.

hbase(main)>list


Insert Values

‘Put’ command is used to insert data.

Column names are specified when values are inserted.

Hbase(main)> put ‘table-name’, ‘row-id’, ‘column-family:column-name’, ‘value’


hbase(main)> put 'employee', 'eid01', 'personal:name', 'Rahul'

hbase(main)> put 'employee', 'eid01', 'personal:gender', 'male'

hbase(main)> put 'employee', 'eid01', 'professional:experience', '5'

hbase(main)> put 'employee', 'eid01', 'professional:salary', '80000'

hbase(main)> put 'employee', 'eid02', 'personal:name', 'Priya’'

hbase(main)> put 'employee', 'eid02', 'personal:gender', 'female'

hbase(main)> put 'employee', 'eid02', 'professional:experience', '2'

hbase(main)> put 'employee', 'eid02', 'professional:salary', '50000'


Scan the entire records

Hbase(main)> scan ‘table-name’

hbase(main)> scan ‘employee’


Get Values

Get all the records of ‘eid01’

hbase(main)> get 'employee', 'eid01', 'personal'


Get the personal details of ‘eid01’

hbase(main)> get 'employee', 'eid01', 'personal'


Get the name of ‘eid01’

hbase(main)> get ‘employee’, ‘eid01’, ‘personal:name’


Update a value

‘Put’ command performs something called as upsert (update or insert).

If row_id exists, content will be updated.

If the row_id does not exist, it will be inserted.


Let’s make ‘Rahul’ to ‘Rajiv’ and run the code. It will be updated.

hbase(main)> put 'employee', 'eid01', 'personal:name', 'Rajiv'


Delete value

Delete a particular column from a column family.

hbase(main)> delete ‘employee’, ‘eid01’, ‘personal:gender’


Drop a Table

Disable the table first and then delete it.

hbase(main)> disable ‘employee’

hbase(main)> drop ‘employee’


Now, we will do the above operations using Java API.


CRUD using Java API


The demonstration is illustrated using Eclipse – an integrated development environment for Java.

Eclipse is installed in DSU/HBase master node.

Please follow these steps carefully.


Step 1: Create a new Java project in Eclipse, name it as you wish and click ‘Finish’.

Step 2: Add the External Jars from /dsu/home/hbase-1.48/lib.

To do that, Right click the folder you created → ‘Build Path’ → ‘Configure Build Path’ → ‘Libraries’ → Add External jars → select all → apply and close.

These jars contain the hbase classes.

Some classes and their functionalities are listed below.

Class Functionalities
HBaseAdmin Create table, checks if table exists, disable a specific table, drop a table
HTableDescriptor Responsible for handling tables
HColumnDescriptor Handles column families
HBaseConfiguration Identifies the HBase configurations
HTable this class is responsible for interacting with any HBase table or DML operations. It has several methods – get, put, delete

Now you are ready to write HBase programs.


Create a table

Create a java class ‘HBaseDDL.java’ inside src. You can use whatever class name you want.


import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseDDL {
	
	public static void main(String[] args) throws IOException{
		// create a configuration object
		// it will store all the default configuration of hbase
		Configuration conf = HBaseConfiguration.create();
		
		HBaseAdmin admin = new HBaseAdmin(conf);
		
		HTableDescriptor des = new HTableDescriptor(Bytes.toBytes("employee"));
		
		des.addFamily(new HColumnDescriptor("personal"));
		des.addFamily(new HColumnDescriptor("professional"));
		
		if(admin.tableExists("employee")) {
			System.out.println("Table Already exists!");
			admin.disableTable("employee");
			admin.deleteTable("employee");
			System.out.println("Table: employee deleted");
		}
		
		admin.createTable(des);
		System.out.println("Table: employee successfully created");
	}

}

Insert Values

 

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class PutDemo {
	public static void main(String[] args)
	{
		Configuration conf = HBaseConfiguration.create();
		
		try {
			HTable table = new HTable(conf, "employee");
			
			Put put = new Put(Bytes.toBytes("eid01"));
			put.add(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("Rahul"));
			put.add(Bytes.toBytes("professional"), Bytes.toBytes("exp"), Bytes.toBytes("4"));
			table.put(put);
			System.out.println("inserted record eid01 to table employee ok.");
			
			put = new Put(Bytes.toBytes("eid02"));
			put.add(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("Suraj"));
			put.add(Bytes.toBytes("professional"), Bytes.toBytes("exp"), Bytes.toBytes("2"));
			table.put(put);
			System.out.println("inserted record eid02 to table employee ok.");
			
		}
		
		catch (IOException e)
		{
			e.printStackTrace();
		}
		
	}

}

Update a Value

In the about PutDemo class, change the name of ‘eid01’ from ‘Rahul’ to ‘Rajiv’ and rerun it again. It will be updated.


Get Values

 

import java.io.IOException;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;

public class GetDemo {
	public static void main(String [] args) throws IOException {
		Configuration conf = HBaseConfiguration.create();
		HTable table = new HTable(conf, "employee");
		Get get = new Get(Bytes.toBytes("eid01"));
		Result rs = table.get(get);
		
		for(KeyValue kv: rs.raw()) {
			System.out.print(new String(kv.getRow()) + " ");
			System.out.print(new String(kv.getFamily()) + ":");
			System.out.print(new String(kv.getQualifier()) + " ");
			System.out.print(kv.getTimestamp() + " ");
			System.out.println(new String(kv.getValue()));
		}
	}

}

Delete a Row

 

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.util.Bytes;

public class DeleteDemo {
	public static void main(String[] args) throws IOException {
		Configuration conf = HBaseConfiguration.create();
		HTable table = new HTable(conf, "employee");
		Delete del = new Delete(Bytes.toBytes("eid01"));
		table.delete(del);
		System.out.println("Row eid01 deleted");
	}

}

Scan the entire records

 

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;

public class ScanDemo {
	public static void main(String[] args) throws IOException {
		Configuration conf = HBaseConfiguration.create();
		HTable table = new HTable(conf, "employee");
		Scan sc = new Scan();
		ResultScanner rs = table.getScanner(sc);
		System.out.println("Get all records\n");
		
		for(Result r:rs) {
			for(KeyValue kv: r.raw()) {
				System.out.print(new String(kv.getRow()) + " ");
				System.out.print(new String(kv.getFamily()) + ":");
				System.out.print(new String(kv.getQualifier()) + " ");
				System.out.print(kv.getTimestamp() + " ");
				System.out.println(new String(kv.getValue()));
				
			}
		}
	}

}

DSU is working on adding more tools from Hadoop Ecosystem in the near future. Please contact us for further details.