Message le plus récent

ImportTSV Data from HDFS into HBase

Is it really hard to insert data inside HBase by writing the scripts? For every record, you have to write an identical script to get data inside HBase. Even though we have same data already present in HDFS.
But what if by writing only a few lines you can have the data copied inside HBase?. It would be a lot of fun to work with HBase then, to get an analytical result much faster than traditional ways. In this blog, you will see a utility which will save us from writing multiple lines of scripts to insert data in HBase. HBase has developed numbers of utilities to make our work easier. Like many of the other HBase utilities, one which we are about to see is ImportTsv.
A utility that loads data in the TSV format into HBase. ImportTsv takes data from HDFS into HBase via Puts.
Find below the syntax used to load data via Puts (i.e., non-bulk loading):
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
In this blog, we will be practicing with small sample dataset how data inside HDFS is loaded into HBase.
Steps to Practical Execution
Yet, Before starting practice on TSV import, it is compulsory to start all the Hadoop and HBase daemons.
While Hadoop is not running, go to Hadoop-X/sbin/start-all.sh
and so start Hadoop-X/sbin/mr-historyserver-daemon.sh.
So, if HMaster is not running, go to Hbase/bin/start-Hbase.sh.

Now our system is ready.

Step1:

Inside Hbase shell give the following command to create table along with 2 column family.
Create ‘bulktable’, ‘cf1’, ‘cf2’

Step2 :

Come out of HBase shell to the terminal and also make a directory for Hbase in the local drive; So,
since you have your own path you can use it.
mkdir -p hbase
Now move to the directory where we will keep our data.
cd hbase

Step3:

Create a file inside the HBase directory named bulk_data.tsv with tab separated data inside using below command in terminal.
vi hbase/bulk_data.tsv
Put these data in,
1    Amit 4
2    Girija  3
3    Jatin   5
4    Swati   3
Once created save the file using esc + :wq + enter

Step4:

Our data should be present in HDFS while performing the import task to Hbase.
In real time projects, the data will already be present inside HDFS.Here for our learning purpose, we copy the data inside HDFS using below commands in terminal.
Command: hadoop fs -mkdir /hbase
Command:
hadoop fs -put bulk_data.tsv /hbase/
Command:
hadoop fs -cat /user/root/hbase

Step5:

After the data is present now in HDFS.In terminal, we give the following command along with arguments<tablename> and <path of data in HDFS>
Command:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –
Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp bulktable \
/user/root/hbase/bulk_data.tsv
Observe that the map is done 100% although we get an error afterward.
For now, ignore the error message due to our task is to map data in HBase table.
Now,also let us check whether we actually got the data inside HBase by using the below command.Scan ‘bulkdata’
We see all the data are present in the table, thus confirming our mapping successful for tab separated values.

Running ImportTsv with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family or a columnfamily:qualifier. Also, the special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and consequently, you must specify a column name for every column that exists in the input data.
Especially relevant, this importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output
 Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
 -Dimporttsv.skip.bad.lines=false – fail if encountering an invalid line
‘-Dimporttsv.separator=|’ – eg
 separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong –
use the specified timestamp for the import -Dimporttsv.mapper.class=my.Mapper – A user-defined Mapper to use instead oforg.apache.hadoop.hbase.mapreduce.TsvImporterMapper
So, Hope this post helped you in importing tab separated values data. For any queries feel free to comment below.
 

Python Tutorial – Data Types in Python

In our previous post, we had looked into the Introduction of Python and its installation.
In all the programming languages programmers would work with data, so data types is one of the crucial ingredient of any programming language.
In Python, all the data types are represented in the form of objects; either built-in objects that Python provides, or objects that user creates using Python classes
Below are the core datatypes provided by Python.
Object type Example literals/creation
Strings ‘Python’, “Programming”, b’a\x01c’, u’sp\xc4m’
Lists [1, [two, ‘eight’], 4.5], list(range(10))
Dictionaries {‘planet’: ‘star’, ‘asteroid’: ‘yum’}, dict(hours=10)
Tuples (1, ‘star’, 4, ‘U’), tuple(‘star’), namedtuple
In programming languages like Java, C++ , we  need to  explicitly declare the variable types. But In  python we  dont have to declare the variable types.
There is specific Python language syntax to create different objects.
Example:
When you run the below code,  i.e sequence of characters enclosed by single quotes, it will create String Object.
>>> ‘Python’
Similarly, an expression wrapped in square brackets makes a list, one in curly braces makes a dictionary, and so on. Even though there are no type declarations in Python, the syntax of the expressions you run determines the types of objects you create and use.
Now, let’s look at each datatype briefly.

Numbers:

In Python, all the numbers are represented in two types:
  1. Integers that have no fractional part,
  2. Floating-point numbers that contain fractional part and complex numbers with imaginary parts, decimals with fixed precision etc.
Numbers are immutable types in Python, i.e. changing the value of number data results in new Object.
Example:
Val1 = 27
python shelll
The above expression creates the variable named ‘val1’ and assigns integer value 27 to it.
Note: In Python, we dont have to specify the data type based on the syntax we used in the value python decides the type dynamically.
If we want to store floating point variable, then just assign decimal value to the variable.

python

Operations on Numbers:

We can perform all the basic mathematical operations on Python numbers Objects like addition, subtraction, multiplication, division etc.
Along with the expression, Python ships with some useful math packages. We just need to import them to use their functionality.

Math Module:

This module provide basic functions like pi, sqrt, ceil, floor etc.

Random Module:

If We want the computer to pick a random number in a given range, from a list, or pick a random card from a deck, flip a coin etc  for such applications we can use random module
Example:
Sample example for generating random integers between 1 to 8.  

Strings:

Strings are immutable collection of characters. In Python, strings can store textual characters or arbitrary collection of bytes (images file contents). In python string objects are stored in a sequence.
Sequences maintain a left-to-right order among the items  and are stored and fetched by their relative positions.
It’s very easy to create the strings objects in Python, any characters which are enclosed by quotes become string (quotes can ‘(single) or “(double)).
Example:
Creating sample string S1
Strings are objects in Python, using which we can perform some basic operations like finding the length of the string, to access character in some position etc.

Strings are immutable, meaning, we cannot change the values. If we try to change the value, we will get an error.
But if you want to change the string, we can run the expression and assign the changed object to the variable.

String Operations:

We  will be  performing following operations :
  • Converting a String from upper case to lowercase and vice versa
  • Finding the off set of the character in the string
  • Concatenation
Concatenating the string with ‘+’ operator, we can concatenate two strings.
  • Splitting the string with specific delimiter.

Lists:

Lists are collection of the same or different type of objects.
For example, a list can contain the details of customers, or students of the class, list of movie titles etc.
Now, we will create the list of customer names.
The above list is a collection of string objects. We can also create the list with different types of objects.
When you create a list in Python, the interpreter creates an array-like data structure in memory to hold your data, with your data items stacked from the bottom up. The first slot in the stack is numbered 0, the second is numbered 1, and the third is numbered 2, and so on.
To access items in the list, we should use the index position of that item starting from zero.

Operations on list:

Appending the element to the list.
deleting the element from list
Sorting the elements in List:
Reversing the order of elements in the list
Iterating over the collection

Dictionary:

The Python dictionary lets you organize your data by associating your data with names(keys) i.e. Mapping of values with keys.
Mappings don’t maintain any reliable left-to-right order; they simply map keys to associated values, and they can grow and shrink on demand.

Creating and Accessing Dictionary:

We can access the values by its associated key.

Operations on Dictionary:

Len function is used to find the number of elements in Dictionary.
We can do key membership test i.e. test whether the specified key is present in the Dictonary or not.

Converting keys of dictionary to list.

Adding and deleting the entries from Dictionary.

The Del method is used to remove item(s) from dictionary.
Getting values and keys from dictionary as a list.

Few Dictionary Examples:

Tuples:

The tuple object is roughly like a list that cannot be changed functionally. They’re used to represent fixed collections of items: the components of a specific calendar date, for instance.

Creating tuples and few basic operations on them.

Why use Tuples when we have lists?

Tuples are similar to lists, they support only few features than list.
only reason we use tuples because they are immutable, i.e.
Since collection of objects around your program passed  as a list, can be changed anywhere but if you use a tuple, it can’t be changed because of its immutability.
We hope this post id helpful in understanding the basics of Python datatypes. In our next post, we will discussing about Python statements.