Loading RCFile Format Data into Oracle Database
RCFile or Record Columnar File format is a flat file data placement structure consisting of binary key/value pairs. Record Columnar implies that columns of a table are stored in a record columnar format. For comparison, in a relational table each row consists a single record with column values for each of the columns. In a RCFile the column values for a single column are stored as a single record in a row. RCFile partitions a relational table first horizontally into row groups (if the table has a large number of rows) and subsequently vertically into columns to store data in columns. A column in a relational table becomes a row in a RCFile. RCFile has performance benefits in data loading and querying and storage space utilization. For example if column n in a relational table stored the sales data, in RCFile format all the sales data is placed in the same row, which makes it easier to compute statistics on the sales data.
A Hive table data may be stored as RCFile. In this tutorial we shall discuss how to load RCFile data from a Hive table into Oracle Database using Oracle Loader for Hadoop 3.0.0. The tutorial has the following sections.
Setting the Environment
We have used Oracle Linux 6.5 installed on Oracle VirtualBox 4.5. The following software is installed on Oracle Linux 6.5.
- Oracle Database 11g
- Hadoop 2.0.0
- Hive 0.10.0
- Oracle Loader for Hadoop 3.0.0
- Java 7
First, create a directory to install the software and set its permissions.
mkdir /rcfile
chmod -R 777 /rcfile
cd /rcfile
Download Java 7 from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html and extract the file using the following command.
>tar zxvf jdk-7u55-linux-i586.gz
Download CDH 4.6 Hadoop 2.0.0 and extract the tar file.
wget http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0.tar.gz
tar -xvf hadoop-2.0.0-cdh4.6.0.tar.gz
Create symlinks for the bin and conf directories.
ln -s /rcfile/hadoop-2.0.0-cdh4.6.0/bin /rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2/bin
ln -s /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop /rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2/conf
Download Oracle Loader for Hadoop 3.0.0 from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html and extract the zip file.
>unzip oraloader-3.0.0-h2.x86_64.zip
Download CDH 4.6 Hive 0.10.0 and extract the tar file.
wget http://archive.cloudera.com/cdh4/cdh/4/hive-0.10.0-cdh4.6.0.tar.gz
tar -xvf hive-0.10.0-cdh4.6.0.tar.gz
Configure the Hadoop properties for the NameNode URI (fs.defaultFS) and temporary directory (hadoop.tmp.dir) in the /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop/core-site.xml configuration file.
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.0.2.15:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///var/lib/hadoop-0.20/cache</value>
</property>
</configuration>
Remove any previously created tmp directory and create a new tmp directory and set its permissions.
rm -rf /var/lib/hadoop-0.20/cache
mkdir -p /var/lib/hadoop-0.20/cache
chmod -R 777 /var/lib/hadoop-0.20/cache
Set the HDFS properties in the /rcfile/hadoop-2.0.0-cdh4.6.0/etc/hadoop/hdfs-site.xml configuration file.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Remove any previously created NameNode storage directory and create a new directory and set its permissions.
rm -rf /data/1/dfs/nn
mkdir -p /data/1/dfs/nn
chmod -R 777 /data/1/dfs/nn
Create a hive-site.xml in the conf directory of the Hive installation by copying the hive-default.xml.template file.
cp /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-default.xml.template /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-site.xml
Set the Hive configuration properties hive.metastore.warehouse.dir, for the directory in which the Hive managed data is stored, and hive.metastore.uris, for the URI/s for a thrift client to access Hive. The Hive properties are configured in the /rcfile/hive-0.10.0-cdh4.6.0/conf/hive-site.xml file.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://10.0.2.15:8020/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:10000</value>
</property>
</configuration>
Set the environment variables for Oracle Database, Hive, Hadoop, Java and Oracle Loader for Hadoop in the bash shell.
vi ~/.bashrc
export HADOOP_PREFIX=/rcfile/hadoop-2.0.0-cdh4.6.0
export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop
export HIVE_HOME=/rcfile/hive-0.10.0-cdh4.6.0
export HIVE_CONF=$HIVE_HOME/conf
export OLH_HOME=/rcfile/oraloader-3.0.0-h2
export JAVA_HOME=/rcfile/jdk1.7.0_55
export ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1
export ORACLE_SID=ORCL
export HADOOP_MAPRED_HOME=/rcfile/hadoop-2.0.0-cdh4.6.0/bin
export HADOOP_HOME=/rcfile/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2
export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HIVE_HOME/lib/*:$OLH_HOME/jlib/*:$HIVE_CONF
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME:$ORACLE_HOME/bin:$HIVE_HOME/bin
export CLASSPATH=$HADOOP_CLASSPATH
Configure the NameNode and start the HDFS (NameNode and DataNode).
hadoop namenode -format
hadoop namenode
hadoop datanode
Creat the Hive storage directory in HDFS and set its permissions.
hadoop dfs -mkdir hdfs://10.0.2.15:8020/user/hive/warehouse
hadoop dfs -chmod -R g+w hdfs://10.0.2.15:8020/user/hive/warehouse
We also need to add Oracle Loader for Hadoop (OLH) and Hive Jar files to HDFS to be made available in the runtime classpath of OLH. Create a directory /rcfile in HDFS and set its permissions.
hdfs dfs -mkdir hdfs://localhost:8020/rcfile
hadoop dfs -chmod -R g+w hdfs://localhost:8020/rcfile
Put the OLH directory into the HDFS /rcfile directory.
hdfs dfs -put /rcfile/oraloader-3.0.0-h2 hdfs://localhost:8020/rcfile
Create a directory structure for Hive’s lib directory in HDFS and set its permissions.
hadoop dfs -mkdir hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0
hadoop dfs -mkdir hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib
hadoop dfs -chmod -R g+w hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib
Put the Hive lib directory jar files into HDFS.
hadoop dfs -put /rcfile/hive-0.10.0-cdh4.6.0/lib/* hdfs://localhost:8020/rcfile/hive-0.10.0-cdh4.6.0/lib
We also need to create a Oracle Database table to load data into from Hive table stored as a RCFile. The Oracle Database table should have the same structure as the Hive table structure.
CREATE TABLE OE.wlslog (time_stamp VARCHAR2(255), category VARCHAR2(255), type VARCHAR2(255), servername VARCHAR2(255), code VARCHAR2(255), msg VARCHAR2(255));
Run the preceding SQL command in SQL*Plus to create the Oracle Database table.
Creating a Hive Table Stored as RCFile
We shall be storing the following data file wlslog.txt as text in HDFS and subsequently as RCFile data for a Hive table.
Apr-8-2014-7:06:16-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to STANDBY
Apr-8-2014-7:06:17-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to STARTING
Apr-8-2014-7:06:18-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to ADMIN
Apr-8-2014-7:06:19-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to RESUMING
Apr-8-2014-7:06:20-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000331,Started WebLogic AdminServer
Apr-8-2014-7:06:21-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000365,Server state changed to RUNNING
Apr-8-2014-7:06:22-PM-PDT,Notice,WebLogicServer,AdminServer,BEA-000360,Server started in RUNNING mode
Create a directory in HDFS to put the wlslog.txt text file and set its permissions.
hdfs dfs -mkdir hdfs://localhost:8020/wlslog
hadoop dfs -chmod -R g+w hdfs://localhost:8020/wlslog
Put the wlslog.txt file into the HDFS /wlslog directory.
hdfs dfs -put wlslog.txt hdfs://localhost:8020/wlslog/wlslog.txt
Start the Hive server.
>hive --service hiveserver
Start the Hive shell.
>hive
Create a Hive external table weblogic_log stored as the text file wlslog.txt. The Hive table has the same columns as the Oracle Database table into which data is to be loaded.
hive>CREATE EXTERNAL TABLE weblogic_log(time_stamp STRING,
category STRING,
type STRING,
servername STRING,
code STRING,
msg STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'hdfs://localhost:8020/wlslog';
Run a SELECT query on the Hive table weblogic_log to select-list the Hive tabled data.
hive>SELECT * from weblogic_log;
Create another Hive table wlslog_RCFile stored as RCFile using the STORED AS RCFILE clause with same columns as the first Hive table.
CREATE TABLE wlslog_RCFile(time_stamp STRING,
category STRING,
type STRING,
servername STRING,
code STRING,
msg STRING) STORED AS RCFILE;
The Hive tables get created as indicated by the output in Hive shell.
Load the Hive table data from the Hive external table weblogic_log for the wlslog.txt file to the Hive table wlslog_RCFile stored as RCFile using the following command in Hive.
INSERT OVERWRITE table wlslog_RCFile SELECT * from weblogic_log;
As the output indicates the wlslog.txt file data gets loaded into the Hive table stored as RCFile.
RCFile Structure
We earlier introduced the RCFile data format being record columnar. In this section we shall illustrate the difference using the text file wlslog.txt and the RCFile data format file for the Hive table, both stored in HDFS.
After creating the Hive table stored as RCFile login to the NameNode web UI with the URL http://localhost:50070. Browse the file system and navigate to the wlslog.txt file in HDFS.
Click on the wlslog.txt file link to display the data stored in the file. The data format is row based with each record in a row.
For comparison, navigate to the wlslog_rcfile in the Hive storage directory /user/hive/warehouse.
Click on the wlslog_rcfile file to display the data file for the Hive table stored as RCFile. The data in RCFile is stored in record columnar format. In a record columnar format the values that would be stored in the same column are stored adjacently in the same row. For example, all the time_stamp values are stored adjacently and the msg values are stored adjacently. All the column values are stored adjacently in the same row and the metadata stores information about the offset for each column data.
But, when a SELECT query is run in Hive shell the data is not selected in a record columnar format. The data is select-listed with each record in a row.
Loading Hive Table Data into Oracle Database
Having loaded the text file data into RCFile based Hive table next we shall load data from the Hive table stored as RCFile into Oracle Database using Oracle Loader for Hadoop 3.0.0. Create a configuration file OraLoadConf.xml to specify the input format and the output format for OLH, the Hive database and table to load from, and the connection parameters to access Oracle Database including the table name. The OraLoadConf.xml file is listed.
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
<!-- Input settings -->
<property>
<name>mapreduce.inputformat.class</name>
<value>oracle.hadoop.loader.lib.input.HiveToAvroInputFormat</value>
</property>
<property>
<name>oracle.hadoop.loader.input.hive.databaseName</name>
<value>default</value>
</property>
<property>
<name>oracle.hadoop.loader.input.hive.tableName</name>
<value>wlslog_rcfile</value>
</property>
<!-- Output settings -->
<property>
<name>mapreduce.job.outputformat.class</name>
<value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>oraloadout</value>
</property>
<!-- Table information -->
<property>
<name>oracle.hadoop.loader.loaderMap.targetTable</name>
<value>OE.WLSLOG</value>
</property>
<!-- Connection information -->
<property>
<name>oracle.hadoop.loader.connection.url</name>
<value>jdbc:oracle:thin:@${HOST}:${TCPPORT}:${SID}</value>
</property>
<property>
<name>TCPPORT</name>
<value>1521</value>
</property>
<property>
<name>HOST</name>
<value>localhost</value>
</property>
<property>
<name>SID</name>
<value>ORCL</value>
</property>
<property>
<name>oracle.hadoop.loader.connection.user</name>
<value>OE</value>
</property>
<property>
<name>oracle.hadoop.loader.connection.password</name>
<value>OE</value>
</property>
</configuration>
Run the following hadoop command to run the Oracle Loader for Hadoop.
hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -conf OraLoadConf.xml -libjars $OLH_HOME/jlib/oraloader.jar
The MapReduce job runs to load the Hive table data into a Oracle Database table.
A more detailed output from the hadoop command to run OLH is listed:
[root@localhost rcfile]# hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -conf OraLoadConf.xml -libjars $OLH_HOME/jlib/oraloader.jar
Oracle Loader for Hadoop Release 3.0.0 - Production
Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.
14/09/13 18:33:34 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 - Production
Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.
14/09/13 18:33:56 INFO loader.OraLoader: oracle.hadoop.loader.loadByPartition is disabled because table: WLSLOG is not partitioned
14/09/13 18:33:56 INFO loader.OraLoader: oracle.hadoop.loader.enableSorting disabled, no sorting key provided
14/09/13 18:33:56 INFO loader.OraLoader: Reduce tasks set to 0 because of no partitioning or sorting. Loading will be done in the map phase.
14/09/13 18:33:56 INFO output.DBOutputFormat: Setting map tasks speculative execution to false for : oracle.hadoop.loader.lib.output.JDBCOutputFormat
14/09/13 18:34:04 INFO loader.OraLoader: Sampling time=0D:0h:0m:0s:686ms (686 ms)
14/09/13 18:34:04 INFO loader.OraLoader: Submitting OraLoader job OraLoader
14/09/13 18:34:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/09/13 18:34:20 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost:10000
14/09/13 18:34:20 INFO hive.metastore: Connected to metastore.
14/09/13 18:34:23 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/13 18:34:23 INFO mapreduce.JobSubmitter: number of splits:1
14/09/13 18:34:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local823391793_0001
14/09/13 18:34:49 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/09/13 18:34:49 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/09/13 18:34:51 INFO mapred.LocalJobRunner: OutputCommitter is oracle.hadoop.loader.lib.output.DBOutputCommitter
14/09/13 18:34:51 INFO mapred.LocalJobRunner: Waiting for map tasks
14/09/13 18:34:52 INFO mapred.LocalJobRunner: Starting task: attempt_local823391793_0001_m_000000_0
14/09/13 18:34:52 INFO loader.OraLoader: map 0% reduce 0%
14/09/13 18:34:54 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/09/13 18:34:54 INFO mapred.MapTask: Processing split: hdfs://10.0.2.15:8020/user/hive/warehouse/wlslog_rcfile/000000_0:0+783
14/09/13 18:34:59 INFO output.DBOutputFormat: conf prop: defaultExecuteBatch: 100
14/09/13 18:35:00 INFO output.DBOutputFormat: conf prop: loadByPartition: false
14/09/13 18:35:04 INFO output.DBOutputFormat: Insert statement: INSERT INTO "OE"."WLSLOG" ("TIME_STAMP", "CATEGORY", "TYPE", "SERVERNAME", "CODE", "MSG") VALUES (?, ?, ?, ?, ?, ?)
14/09/13 18:35:04 INFO mapred.LocalJobRunner:
14/09/13 18:35:06 INFO mapred.Task: Task:attempt_local823391793_0001_m_000000_0 is done. And is in the process of committing
14/09/13 18:35:06 INFO mapred.LocalJobRunner:
14/09/13 18:35:06 INFO mapred.Task: Task attempt_local823391793_0001_m_000000_0 is allowed to commit now
14/09/13 18:35:07 INFO output.JDBCOutputFormat: Committed work for task attempt attempt_local823391793_0001_m_000000_0
14/09/13 18:35:07 INFO output.FileOutputCommitter: Saved output of task 'attempt_local823391793_0001_m_000000_0' to hdfs://10.0.2.15:8020/user/root/oraloadout/_temporary/0/task_local823391793_0001_m_000000
14/09/13 18:35:07 INFO mapred.LocalJobRunner: map
14/09/13 18:35:07 INFO mapred.Task: Task 'attempt_local823391793_0001_m_000000_0' done.
14/09/13 18:35:07 INFO mapred.LocalJobRunner: Finishing task: attempt_local823391793_0001_m_000000_0
14/09/13 18:35:07 INFO mapred.LocalJobRunner: Map task executor complete.
14/09/13 18:35:07 INFO loader.OraLoader: map 100% reduce 0%
14/09/13 18:35:07 INFO loader.OraLoader: Job complete: OraLoader (job_local823391793_0001)
14/09/13 18:35:07 INFO loader.OraLoader: Counters: 23
File System Counters
FILE: Number of bytes read=10412642
FILE: Number of bytes written=11375302
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10423979
HDFS: Number of bytes written=9769986
HDFS: Number of read operations=239
HDFS: Number of large read operations=0
HDFS: Number of write operations=36
Map-Reduce Framework
Map input records=7
Map output records=7
Input split bytes=1133
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=200
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=19501056
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=1624
[root@localhost rcfile]#
Subsequently run a SELECT statement in SQL*Plus to select-list the loaded data.
ORC (Optimized Row Columnar) File format is a more efficient row columnar format than the RCFile format; takes less storage space and access time than RCFile. Support to store a Hive table as ORC was added in Hive 0.11. Replace STORED AS RCFile with STORED AS ORC to store a Hive table as ORC format.
In this tutorial we loaded data from a Hive table stored as RCFile into Oracle Database table using Oracle Loader for Hadoop 3.0.0.
대용량으로 밀어 넣는 경우 외에 쓸일이 있을까 싶긴하다.
'Biz > Etc' 카테고리의 다른 글
a loading into HDFS - Part1 (0) | 2016.05.18 |
---|---|
배치로 Hive 로 보내기 (0) | 2016.04.03 |
Bringing ORC Support into Apache Spark (0) | 2016.03.25 |
A Lap Around Apache Spark on HDP (0) | 2016.03.25 |
Learning the Ropes of the Hortonworks Sandbox (0) | 2016.03.25 |
댓글