Loading RCFile Format Data into Oracle Database

RCFile or Record Columnar File format is a flat file data placement structure consisting of binary key/value pairs. Record Columnar implies that columns of a table are stored in a record columnar format. For comparison, in a relational table each row consists a single record with column values for each of the columns. In a RCFile the column values for a single column are stored as a single record in a row. RCFile partitions a relational table first horizontally into row groups (if the table has a large number of rows) and subsequently vertically into columns to store data in columns. A column in a relational table becomes a row in a RCFile. RCFile has performance benefits in data loading and querying and storage space utilization. For example if column n in a relational table stored the sales data, in RCFile format all the sales data is placed in the same row, which makes it easier to compute statistics on the sales data.

A Hive table data may be stored as RCFile. In this tutorial we shall discuss how to load RCFile data from a Hive table into Oracle Database using Oracle Loader for Hadoop 3.0.0. The tutorial has the following sections.

Setting the Environment

We have used Oracle Linux 6.5 installed on Oracle VirtualBox 4.5. The following software is installed on Oracle Linux 6.5.

Oracle Database 11g
Hadoop 2.0.0
Hive 0.10.0
Oracle Loader for Hadoop 3.0.0
Java 7

First, create a directory to install the software and set its permissions.

mkdir /rcfile

chmod -R 777 /rcfile

cd /rcfile

Download Java 7 from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html and extract the file using the following command.

>tar zxvf jdk-7u55-linux-i586.gz

Download CDH 4.6 Hadoop 2.0.0 and extract the tar file.

wget http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0.tar.gz

tar -xvf hadoop-2.0.0-cdh4.6.0.tar.gz

Create symlinks for the bin and conf directories.

ln -s /rcfile/hadoop-2.0.0-cdh4.6.0/bin /rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2/bin

ln -s /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop /rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2/conf

Download Oracle Loader for Hadoop 3.0.0 from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html and extract the zip file.

>unzip oraloader-3.0.0-h2.x86_64.zip

Download CDH 4.6 Hive 0.10.0 and extract the tar file.

wget http://archive.cloudera.com/cdh4/cdh/4/hive-0.10.0-cdh4.6.0.tar.gz

tar -xvf hive-0.10.0-cdh4.6.0.tar.gz

Configure the Hadoop properties for the NameNode URI (fs.defaultFS) and temporary directory (hadoop.tmp.dir) in the /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop/core-site.xml configuration file.

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.defaultFS</name>

</property>

<name>hadoop.tmp.dir</name>

<value>file:///var/lib/hadoop-0.20/cache</value>

</property>

</configuration>

Remove any previously created tmp directory and create a new tmp directory and set its permissions.

rm -rf /var/lib/hadoop-0.20/cache

mkdir -p /var/lib/hadoop-0.20/cache

chmod -R 777 /var/lib/hadoop-0.20/cache

Set the HDFS properties in the /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop/hdfs-site.xml configuration file.

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.permissions.superusergroup</name>

<value>hadoop</value>

</property>

<name>dfs.namenode.name.dir</name>

</property>

<name>dfs.replication</name>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

Remove any previously created NameNode storage directory and create a new directory and set its permissions.

rm -rf /data/1/dfs/nn

mkdir -p /data/1/dfs/nn

chmod -R 777 /data/1/dfs/nn

Create a hive-site.xml in the conf directory of the Hive installation by copying the hive-default.xml.template file.

cp /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-default.xml.template /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-site.xml

Set the Hive configuration properties hive.metastore.warehouse.dir, for the directory in which the Hive managed data is stored, and hive.metastore.uris, for the URI/s for a thrift client to access Hive. The Hive properties are configured in the /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-site.xml file.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hive.metastore.warehouse.dir</name>

<value>hdfs://10.0.2.15:8020/user/hive/warehouse</value>

</property>

<name>hive.metastore.uris</name>

<value>thrift://localhost:10000</value>

</property>

</configuration>

Set the environment variables for Oracle Database, Hive, Hadoop, Java and Oracle Loader for Hadoop in the bash shell.

vi ~/.bashrc

export HADOOP_PREFIX=/rcfile/hadoop-2.0.0-cdh4.6.0

export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop

export HIVE_HOME=/rcfile/hive-0.10.0-cdh4.6.0

export HIVE_CONF=$HIVE_HOME/conf

export OLH_HOME=/rcfile/oraloader-3.0.0-h2

export JAVA_HOME=/rcfile/jdk1.7.0_55

export ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1

export ORACLE_SID=ORCL

export HADOOP_MAPRED_HOME=/rcfile/hadoop-2.0.0-cdh4.6.0/bin

export HADOOP_HOME=/rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2

export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HIVE_HOME/lib/*:$OLH_HOME/jlib/*:$HIVE_CONF

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME:$ORACLE_HOME/bin:$HIVE_HOME/bin

export CLASSPATH=$HADOOP_CLASSPATH

Configure the NameNode and start the HDFS (NameNode and DataNode).

hadoop namenode -format

hadoop namenode

hadoop datanode

Creat the Hive storage directory in HDFS and set its permissions.

hadoop dfs -mkdir hdfs://10.0.2.15:8020/user/hive/warehouse

hadoop dfs -chmod -R g+w hdfs://10.0.2.15:8020/user/hive/warehouse

We also need to add Oracle Loader for Hadoop (OLH) and Hive Jar files to HDFS to be made available in the runtime classpath of OLH. Create a directory /rcfile in HDFS and set its permissions.

hdfs dfs -mkdir hdfs://localhost:8020/rcfile

hadoop dfs -chmod -R g+w hdfs://localhost:8020/rcfile

Put the OLH directory into the HDFS /rcfile directory.

hdfs dfs -put /rcfile/oraloader-3.0.0-h2 hdfs://localhost:8020/rcfile

Create a directory structure for Hive’s lib directory in HDFS and set its permissions.

hadoop dfs -mkdir hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0

hadoop dfs -mkdir hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib

hadoop dfs -chmod -R g+w hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib

Put the Hive lib directory jar files into HDFS.

hadoop dfs -put /rcfile/hive-0.10.0-cdh4.6.0/lib/* hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib

We also need to create a Oracle Database table to load data into from Hive table stored as a RCFile. The Oracle Database table should have the same structure as the Hive table structure.

CREATE TABLE OE.wlslog (time_stamp VARCHAR2(255), category VARCHAR2(255), type VARCHAR2(255), servername VARCHAR2(255), code VARCHAR2(255), msg VARCHAR2(255));

Run the preceding SQL command in SQL*Plus to create the Oracle Database table.

Creating a Hive Table Stored as RCFile

We shall be storing the following data file wlslog.txt as text in HDFS and subsequently as RCFile data for a Hive table.

Apr-8-2014-7:06:16-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to STANDBY

Apr-8-2014-7:06:17-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to STARTING

Apr-8-2014-7:06:18-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to ADMIN

Apr-8-2014-7:06:19-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to RESUMING

Apr-8-2014-7:06:20-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000331,Started WebLogic AdminServer

Apr-8-2014-7:06:21-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to RUNNING

Apr-8-2014-7:06:22-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000360,Server started in RUNNING mode

Create a directory in HDFS to put the wlslog.txt text file and set its permissions.

hdfs dfs -mkdir hdfs://localhost:8020/wlslog

hadoop dfs -chmod -R g+w hdfs://localhost:8020/wlslog

Put the wlslog.txt file into the HDFS /wlslog directory.

hdfs dfs -put wlslog.txt hdfs://localhost:8020/wlslog/wlslog.txt

Start the Hive server.

>hive --service hiveserver

Start the Hive shell.

>hive

Create a Hive external table weblogic_log stored as the text file wlslog.txt. The Hive table has the same columns as the Oracle Database table into which data is to be loaded.

hive>CREATE EXTERNAL TABLE weblogic_log(time_stamp STRING,

category STRING,

type STRING,

servername STRING,

code STRING,

msg STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'hdfs://localhost:8020/wlslog';

Run a SELECT query on the Hive table weblogic_log to select-list the Hive tabled data.

hive>SELECT * from weblogic_log;

Create another Hive table wlslog_RCFile stored as RCFile using the STORED AS RCFILE clause with same columns as the first Hive table.

CREATE TABLE wlslog_RCFile(time_stamp STRING,

category STRING,

type STRING,

servername STRING,

code STRING,

msg STRING) STORED AS RCFILE;

The Hive tables get created as indicated by the output in Hive shell.

Load the Hive table data from the Hive external table weblogic_log for the wlslog.txt file to the Hive table wlslog_RCFile stored as RCFile using the following command in Hive.

INSERT OVERWRITE table wlslog_RCFile SELECT * from weblogic_log;

As the output indicates the wlslog.txt file data gets loaded into the Hive table stored as RCFile.

RCFile Structure

We earlier introduced the RCFile data format being record columnar. In this section we shall illustrate the difference using the text file wlslog.txt and the RCFile data format file for the Hive table, both stored in HDFS.

After creating the Hive table stored as RCFile login to the NameNode web UI with the URL http://localhost:50070. Browse the file system and navigate to the wlslog.txt file in HDFS.

Click on the wlslog.txt file link to display the data stored in the file. The data format is row based with each record in a row.

For comparison, navigate to the wlslog_rcfile in the Hive storage directory /user/hive/warehouse.

Click on the wlslog_rcfile file to display the data file for the Hive table stored as RCFile. The data in RCFile is stored in record columnar format. In a record columnar format the values that would be stored in the same column are stored adjacently in the same row. For example, all the time_stamp values are stored adjacently and the msg values are stored adjacently. All the column values are stored adjacently in the same row and the metadata stores information about the offset for each column data.

But, when a SELECT query is run in Hive shell the data is not selected in a record columnar format. The data is select-listed with each record in a row.

Loading Hive Table Data into Oracle Database

Having loaded the text file data into RCFile based Hive table next we shall load data from the Hive table stored as RCFile into Oracle Database using Oracle Loader for Hadoop 3.0.0. Create a configuration file OraLoadConf.xml to specify the input format and the output format for OLH, the Hive database and table to load from, and the connection parameters to access Oracle Database including the table name. The OraLoadConf.xml file is listed.

<?xml version="1.0" encoding="UTF-8" ?>

<name>mapreduce.inputformat.class</name>

<value>oracle.hadoop.loader.lib.input.HiveToAvroInputFormat</value>

</property>

<name>oracle.hadoop.loader.input.hive.databaseName</name>

<value>default</value>

</property>

<name>oracle.hadoop.loader.input.hive.tableName</name>

<value>wlslog_rcfile</value>

</property>

<name>mapreduce.job.outputformat.class</name>

<value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>

</property>

<name>mapreduce.output.fileoutputformat.outputdir</name>

<value>oraloadout</value>

</property>

<name>oracle.hadoop.loader.loaderMap.targetTable</name>

<value>OE.WLSLOG</value>

</property>

<name>oracle.hadoop.loader.connection.url</name>

<value>jdbc:oracle:thin:@${HOST}:${TCPPORT}:${SID}</value>

</property>

<name>TCPPORT</name>

</property>

<value>localhost</value>

</property>

</property>

<name>oracle.hadoop.loader.connection.user</name>

</property>

<name>oracle.hadoop.loader.connection.password</name>

</property>

</configuration>

Run the following hadoop command to run the Oracle Loader for Hadoop.

hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -conf OraLoadConf.xml -libjars $OLH_HOME/jlib/oraloader.jar

The MapReduce job runs to load the Hive table data into a Oracle Database table.

A more detailed output from the hadoop command to run OLH is listed:

[root@localhost rcfile]# hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -conf OraLoadConf.xml -libjars $OLH_HOME/jlib/oraloader.jar

Oracle Loader for Hadoop Release 3.0.0 - Production

14/09/13 18:33:34 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 - Production

14/09/13 18:33:56 INFO loader.OraLoader: oracle.hadoop.loader.loadByPartition is disabled because table: WLSLOG is not partitioned

14/09/13 18:33:56 INFO loader.OraLoader: oracle.hadoop.loader.enableSorting disabled, no sorting key provided

14/09/13 18:33:56 INFO loader.OraLoader: Reduce tasks set to 0 because of no partitioning or sorting. Loading will be done in the map phase.

14/09/13 18:33:56 INFO output.DBOutputFormat: Setting map tasks speculative execution to false for : oracle.hadoop.loader.lib.output.JDBCOutputFormat

14/09/13 18:34:04 INFO loader.OraLoader: Sampling time=0D:0h:0m:0s:686ms (686 ms)

14/09/13 18:34:04 INFO loader.OraLoader: Submitting OraLoader job OraLoader

14/09/13 18:34:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

14/09/13 18:34:20 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost:10000

14/09/13 18:34:20 INFO hive.metastore: Connected to metastore.

14/09/13 18:34:23 INFO mapred.FileInputFormat: Total input paths to process : 1

14/09/13 18:34:23 INFO mapreduce.JobSubmitter: number of splits:1

14/09/13 18:34:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local823391793_0001

14/09/13 18:34:49 INFO mapreduce.Job: The url to track the job: http://localhost:8080/

14/09/13 18:34:49 INFO mapred.LocalJobRunner: OutputCommitter set in config null

14/09/13 18:34:51 INFO mapred.LocalJobRunner: OutputCommitter is oracle.hadoop.loader.lib.output.DBOutputCommitter

14/09/13 18:34:51 INFO mapred.LocalJobRunner: Waiting for map tasks

14/09/13 18:34:52 INFO mapred.LocalJobRunner: Starting task: attempt_local823391793_0001_m_000000_0

14/09/13 18:34:52 INFO loader.OraLoader: map 0% reduce 0%

14/09/13 18:34:54 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]

14/09/13 18:34:54 INFO mapred.MapTask: Processing split: hdfs://10.0.2.15:8020/user/hive/warehouse/wlslog_rcfile/000000_0:0+783

14/09/13 18:34:59 INFO output.DBOutputFormat: conf prop: defaultExecuteBatch: 100

14/09/13 18:35:00 INFO output.DBOutputFormat: conf prop: loadByPartition: false

14/09/13 18:35:04 INFO output.DBOutputFormat: Insert statement: INSERT INTO "OE"."WLSLOG" ("TIME_STAMP", "CATEGORY", "TYPE", "SERVERNAME", "CODE", "MSG") VALUES (?, ?, ?, ?, ?, ?)

14/09/13 18:35:04 INFO mapred.LocalJobRunner:

14/09/13 18:35:06 INFO mapred.Task: Task:attempt_local823391793_0001_m_000000_0 is done. And is in the process of committing

14/09/13 18:35:06 INFO mapred.LocalJobRunner:

14/09/13 18:35:06 INFO mapred.Task: Task attempt_local823391793_0001_m_000000_0 is allowed to commit now

14/09/13 18:35:07 INFO output.JDBCOutputFormat: Committed work for task attempt attempt_local823391793_0001_m_000000_0

14/09/13 18:35:07 INFO output.FileOutputCommitter: Saved output of task 'attempt_local823391793_0001_m_000000_0' to hdfs://10.0.2.15:8020/user/root/oraloadout/_temporary/0/task_local823391793_0001_m_000000

14/09/13 18:35:07 INFO mapred.LocalJobRunner: map

14/09/13 18:35:07 INFO mapred.Task: Task 'attempt_local823391793_0001_m_000000_0' done.

14/09/13 18:35:07 INFO mapred.LocalJobRunner: Finishing task: attempt_local823391793_0001_m_000000_0

14/09/13 18:35:07 INFO mapred.LocalJobRunner: Map task executor complete.

14/09/13 18:35:07 INFO loader.OraLoader: map 100% reduce 0%

14/09/13 18:35:07 INFO loader.OraLoader: Job complete: OraLoader (job_local823391793_0001)

14/09/13 18:35:07 INFO loader.OraLoader: Counters: 23

File System Counters

FILE: Number of bytes read=10412642

FILE: Number of bytes written=11375302

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=10423979

HDFS: Number of bytes written=9769986

HDFS: Number of read operations=239

HDFS: Number of large read operations=0

HDFS: Number of write operations=36

Map-Reduce Framework

Map input records=7

Map output records=7

Input split bytes=1133

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=200

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

Total committed heap usage (bytes)=19501056

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=1624

[root@localhost rcfile]#

Subsequently run a SELECT statement in SQL*Plus to select-list the loaded data.

ORC (Optimized Row Columnar) File format is a more efficient row columnar format than the RCFile format; takes less storage space and access time than RCFile. Support to store a Hive table as ORC was added in Hive 0.11. Replace STORED AS RCFile with STORED AS ORC to store a Hive table as ORC format.

In this tutorial we loaded data from a Hive table stored as RCFile into Oracle Database table using Oracle Loader for Hadoop 3.0.0.

출처: http://www.toadworld.com/platforms/oracle/w/wiki/11356.loading-rcfile-format-data-into-oracle-database#creating-a-hive-table-stored-as-rcfile

대용량으로 밀어 넣는 경우 외에 쓸일이 있을까 싶긴하다.

저작자표시 (새창열림)

'Biz > Etc' 카테고리의 다른 글

a loading into HDFS - Part1 (0)	2016.05.18
배치로 Hive 로 보내기 (0)	2016.04.03
Bringing ORC Support into Apache Spark (0)	2016.03.25
A Lap Around Apache Spark on HDP (0)	2016.03.25
Learning the Ropes of the Hortonworks Sandbox (0)	2016.03.25

둘러가기

Loading RCFile Format Data into Oracle Database

Loading RCFile Format Data into Oracle Database

Setting the Environment

Creating a Hive Table Stored as RCFile

RCFile Structure

Loading Hive Table Data into Oracle Database

'Biz > Etc' 카테고리의 다른 글

댓글

티스토리툴바

Loading RCFile Format Data into Oracle Database

Loading RCFile Format Data into Oracle Database

Setting the Environment

Creating a Hive Table Stored as RCFile

RCFile Structure

Loading Hive Table Data into Oracle Database

'Biz > Etc' 카테고리의 다른 글

관련글

댓글

티스토리툴바