HBase importtsv tool
Overview
HBase is shipped with a command line tool named 'importtsv' which can be used to load data from HDFS into HBase tables efficiently.
When massive data need to be loaded into an HBase table, this can be useful to optimize performances and resources.
Semarchy xDI Templates offer the possibility to load data from any database into HBase using this tool with little configuration in Metadata and Mapping.
Prerequisites
There are some prerequisites to be able to use the importtsv tool with Semarchy xDI, listed below:
-
An HBase Metadata
-
An HDFS temporary folder to store temporary data is required
-
An SSH Metadata that will be used to run the importtsv command on the remote Hadoop server.
-
(Optional) The Kerberos Keytab path on the remote Hadoop server if it is secured by Kerberos
The next sections of this page consider that you already have those resources at your disposal.
Refer to Getting started with the Hadoop component for further information about HBase Metadata. |
Configure HBase metadata for importtsv
HDFS temporary folder
As the importtsv tool purpose it to load data from HDFS to HBase, we need a temporary HDFS folder to store source data before loading it to the target table.
Simply drag and drop the HDFS folder Metadata link you want to use as temporary folder into the HBase Metadata.
Then, rename it to 'HDFS':
Refer to Getting started with the Hadoop component for further information about HDFS Metadata configuration |
Sqoop utility (optional)
Default behavior is to send temporary data into HDFS using HDFS APIs.
Note that you also have the possibility to configure it to be sent into HDFS through the Sqoop Hadoop utility, if you prefer. This is optional.
If you want to do so, drag and drop a Sqoop Metadata Link in the Metadata of the HDFS temporary folder you defined in previous section.
Then, rename it to 'SQOOP':
Refer to Getting started with the Hadoop component for further information about Sqoop Metadata configuration |
Specify the remote server information
Specify an SSH connection
The command will be executed through SSH on the remote Hadoop server.
The HBase Metadata therefore requires the information about how to connect to this server.
Simply drag and drop an SSH Metadata Link containing the SSH connection information in the HBase Metadata.
Then, rename it to 'SSH':
Templates only support executing the command through SSH at the moment. We’re working on updating them to add an alternative to also be able to execute it locally to the Runtime if required, without needing an SSH connection. |
Specify the Kerberos keytab path (optional)
If the Hadoop cluster is secured with Kerberos, an authentication must be performed on the server before executing the command.
As the command is started through SSH, you need to indicate where is located the Keytab that must be used to authenticate on the remote server.
For this simply specify the 'Kerberos Remote Keytab File Path' in the Kerberos Principal used by HBase.
Refer to Getting started with Kerberos for more information. |