본문 바로가기
Biz/Analysis

RStudio 에서 Spark 사용하기

by 조병희 2016. 3. 26.

로컬모드로 Spark 를 띄우기 위해 먼저 Spark 부터 받자

http://spark.apache.org

혹은 새로 빌드를 하거나

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ........................... SUCCESS [ 13.980 s]

[INFO] Spark Project Test Tags ............................ SUCCESS [01:04 min]

[INFO] Spark Project Sketch ............................... SUCCESS [ 20.141 s]

[INFO] Spark Project Networking ........................... SUCCESS [ 27.022 s]

[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 21.725 s]

[INFO] Spark Project Unsafe ............................... SUCCESS [ 35.521 s]

[INFO] Spark Project Launcher ............................. SUCCESS [ 26.792 s]

[INFO] Spark Project Core ................................. SUCCESS [08:13 min]

[INFO] Spark Project GraphX ............................... SUCCESS [01:25 min]

[INFO] Spark Project Streaming ............................ SUCCESS [02:52 min]

[INFO] Spark Project Catalyst ............................. SUCCESS [06:12 min]

[INFO] Spark Project SQL .................................. SUCCESS [06:28 min]

[INFO] Spark Project ML Library ........................... SUCCESS [06:23 min]

[INFO] Spark Project Tools ................................ SUCCESS [ 21.218 s]

[INFO] Spark Project Hive ................................. SUCCESS [04:34 min]

[INFO] Spark Project Docker Integration Tests ............. SUCCESS [ 25.573 s]

[INFO] Spark Project REPL ................................. SUCCESS [ 37.997 s]

[INFO] Spark Project Assembly ............................. SUCCESS [ 43.001 s]

[INFO] Spark Project External Flume Sink .................. SUCCESS [ 34.540 s]

[INFO] Spark Project External Flume ....................... SUCCESS [ 51.423 s]

[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 5.614 s]

[INFO] Spark Project External Kafka ....................... SUCCESS [01:07 min]

[INFO] Spark Project Examples ............................. SUCCESS [02:21 min]

[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 7.941 s]

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 46:56 min

[INFO] Finished at: 2016-03-26T20:05:46+09:00

[INFO] Final Memory: 107M/1622M

[INFO] ------------------------------------------------------------------------

시간이 무지 걸리네요.

SQL context available as sqlContext.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT

/_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)

Type in expressions to have them evaluated.

Type :help for more information.

scala> sc.version

res0: String = 2.0.0-SNAPSHOT

소문에 Spark 1.6버전은 2.0 가기 전 테스트 단계라더니 빌드해보니 2.0.0-SNAPSHOT 네요. 2.0 이 멀지 않았나 보다. 여튼. R과 Spark 정도는 환경변수에서 패스를 잡아야 속 편하다. 혹시 저처럼 git 에서 받을 경우에는 R 에서 library(SparkR) 가 없다고 나올수 있으니 C:\app\spark\R 에서 install-dev 실행하면 R 패키지가 생성된다.

C:\app\spark\R>install-dev.bat

* installing *source* package 'SparkR' ...

** R

** inst

** preparing package for lazy loading

Creating a new generic function for 'colnames' in package 'SparkR'

Creating a new generic function for 'colnames<-' in package 'SparkR'

Creating a new generic function for 'cov' in package 'SparkR'

Creating a new generic function for 'drop' in package 'SparkR'

Creating a new generic function for 'na.omit' in package 'SparkR'

Creating a new generic function for 'filter' in package 'SparkR'

Creating a new generic function for 'intersect' in package 'SparkR'

Creating a new generic function for 'sample' in package 'SparkR'

Creating a new generic function for 'transform' in package 'SparkR'

Creating a new generic function for 'subset' in package 'SparkR'

Creating a new generic function for 'summary' in package 'SparkR'

Creating a new generic function for 'lag' in package 'SparkR'

Creating a new generic function for 'rank' in package 'SparkR'

Creating a new generic function for 'sd' in package 'SparkR'

Creating a new generic function for 'var' in package 'SparkR'

Creating a new generic function for 'predict' in package 'SparkR'

Creating a new generic function for 'rbind' in package 'SparkR'

Creating a generic function for 'alias' from package 'stats' in package 'SparkR'

Creating a generic function for 'substr' from package 'base' in package 'SparkR'

Creating a generic function for '%in%' from package 'base' in package 'SparkR'

Creating a generic function for 'mean' from package 'base' in package 'SparkR'

Creating a generic function for 'lapply' from package 'base' in package 'SparkR'

Creating a generic function for 'Filter' from package 'base' in package 'SparkR'

Creating a generic function for 'unique' from package 'base' in package 'SparkR'

Creating a generic function for 'nrow' from package 'base' in package 'SparkR'

Creating a generic function for 'ncol' from package 'base' in package 'SparkR'

Creating a generic function for 'head' from package 'utils' in package 'SparkR'

Creating a generic function for 'factorial' from package 'base' in package 'SparkR'

Creating a generic function for 'atan2' from package 'base' in package 'SparkR'

Creating a generic function for 'ifelse' from package 'base' in package 'SparkR'

** help

No man pages found in package 'SparkR'

*** installing help indices

** building package indices

** testing if installed package can be loaded

*** arch - i386

*** arch - x64

* DONE (SparkR)

C:\app\spark\R>dir lib

C 드라이브의 볼륨에는 이름이 없습니다.

볼륨 일련 번호: C68F-889D

C:\app\spark\R\lib 디렉터리

2016-03-26 오후 10:18 <DIR> .

2016-03-26 오후 10:18 <DIR> ..

2016-03-26 오후 10:18 <DIR> SparkR

2016-03-26 오후 10:18 590,033 sparkr.zip

1개 파일 590,033 바이트

3개 디렉터리 105,600,221,184 바이트 남음

C:\app\spark\R>

이제 RStudio 켜서 다음 같이 해보자.

> Sys.setenv(SPARK_HOME = "C:/app/spark")

 

> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

 

> library(SparkR)

 

 

다음의
					패키지를
					부착합니다: 'SparkR'

 

 

The following objects are masked from 'package:stats':

 

 

    cov, filter, lag, na.omit, predict, sd, var

 

 

The following objects are masked from 'package:base':

 

 

    colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, summary, transform

 

> sc <- sparkR.init(master = "local")

 

Launching java with spark-submit command C:/app/spark/bin/spark-submit.cmd   sparkr-shell C:\Users\bhjo0\AppData\Local\Temp\RtmpA7HkfH\backend_port23c82ced221c 

 

> sqlContext <- sparkRSQL.init(sc)

 

> sc

 

Java ref type org.apache.spark.api.java.JavaSparkContext id 0 

 

> sqlContext

 

Java ref type org.apache.spark.sql.SQLContext id 1 

 

> DF <- createDataFrame(sqlContext, faithful)

 

> head(DF)

 

  eruptions waiting

 

1     3.600      79

 

2     1.800      54

 

3     3.333      74

 

4     2.283      62

 

5     4.533      85

 

6     2.883      55

 

> localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))

 

> df <- createDataFrame(sqlContext, localDF)

 

> printSchema(df)

 

root

 

 |-- name: string (nullable = true)

 

 |-- age: double (nullable = true)

 

> path <- file.path(Sys.getenv("SPARK_HOME"), "examples/src/main/resources/people.json")

 

> peopleDF <- jsonFile(sqlContext, path)

 

Warning message:

 

'jsonFile' is deprecated.

 

Use 'read.json' instead.

 

See help("Deprecated") 

 

> printSchema(peopleDF)

 

root

 

 |-- age: long (nullable = true)

 

 |-- name: string (nullable = true)

 

> registerTempTable(peopleDF, "people")

 

> teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")

 

> teenagersLocalDF <- collect(teenagers)

 

> print(teenagersLocalDF)

 

    name

 

1 Justin

 

> sparkR.stop()
					

 

만약 library(sparkR) 에서 오류가 나면 패스나 빌드가 안된 경우 일 테고, 그외에 로컬이 아닌 하둡내 설치된 Spark 에서 사용하고자 하면 init 할 때 적당히 넣어주면 된다. 만약 계속 사용할 거면 그냥 설치해도 된다.

 

 

> install.packages("C:/app/spark/R/lib/sparkr.zip", repos = NULL, type = "win.binary")
					

 

> library(SparkR)
					

 

 

'Biz > Analysis' 카테고리의 다른 글

RStudio Server with a Proxy  (0) 2016.07.17
Shiny에서 SparkR 실행하기  (0) 2016.03.28
Hands-on Tour of Apache Spark in 5 Minutes  (0) 2016.03.25
Magellan: Geospatial Analytics on Spark  (0) 2016.03.25
RStudio 시작시 오류  (0) 2014.12.31

댓글