DB-hub Technology 未分类 搭建Spark开发环境

搭建Spark开发环境

ref: https://github.com/lyhue1991/eat_pyspark_in_10_days

1.安装Java8

Java 1.8 (Java 8) installed. That will work for older versions of Scala and Spark, but for modern Spark (3.x) and Scala (2.13 / 3.x), it’s recommended to use Java 17 or later.

sudo dnf install java-17-openjdk-devel -y

sudo alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/java/jdk1.8.0_202-amd64/jre/bin/java
   2           java-17-openjdk.x86_64 (/usr/lib/jvm/java-17-openjdk-17.0.17.0.10-1.el8.x86_64/bin/java)

Enter to keep the current selection[+], or type selection number: 2

sudo alternatives --config javac

There are 2 programs which provide 'javac'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/java/jdk1.8.0_202-amd64/bin/javac
   2           java-17-openjdk.x86_64 (/usr/lib/jvm/java-17-openjdk-17.0.17.0.10-1.el8.x86_64/bin/javac)

Enter to keep the current selection[+], or type selection number: 2

export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64

安装成功后,在命令行中输入 java -version

2.安装Python
3.安装Spark

spark官网下载: http://spark.apache.org/downloads.html

下载后解压放入到一个常用软件的安装路径,如:

/opt/spark/spark-3.4.4-bin-hadoop3

export PYTHON_TOP_DIR=/usr/local/bin
export PYTHONPATH=/usr/bin/python3.11
export SPARK_HOME=/opt/spark/spark-3.4.4-bin-hadoop3
export PYSPARK_PYTHON=PYTHONPATH
export PYSPARK_DRIVER_PYTHON=PYTHONPATH
export PATH=SPARK_HOME/bin:PYTHON_TOP_DIR:$PATH
spark-submit --version
spark-shell --version
4.安装findspark
pip install findspark

pip list
Package    Version
---------- -------
findspark  2.0.1
pip        23.2.1
setuptools 68.2.2

安装成功后可以运行如下代码:

import findspark
import pyspark 
from pyspark import SparkContext, SparkConf

#指定spark_home为刚才的解压路径,指定python路径
spark_home = "/Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2"
python_path = "/Users/liangyun/anaconda3/bin/python"
findspark.init(spark_home,python_path)

conf = SparkConf().setAppName("test").setMaster("local[4]")
sc = SparkContext(conf=conf)

print("spark version:",pyspark.__version__)
rdd = sc.parallelize(["hello","spark"])
print(rdd.reduce(lambda x,y:x+' '+y))


spark version: 3.4.4
hello spark
5.Install Scala
sudo dnf install scala -y
scala -version
Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL
6.Install JDBC for oracle

Go to the official Oracle download page:
https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html

Choose the ojdbc8.jar (for Java 8–17) or ojdbc11.jar (for Java 11+).

Accept the license agreement and download the .jar file.

7.Install sqlplus

dowload : https://www.oracle.com/database/technologies/instant-client/linux-x86-64-downloads.html?referrer=grok.com

  • instant client –> oracle-instantclient19.29-basic-19.29.0.0.0-1.x86_64.rpm
  • sqlplus –> oracle-instantclient19.29-sqlplus-19.29.0.0.0-1.x86_64.rpm

Testing connection:

sqlplus spark/spark@localhost:1521/HAMSTER

Create a table

CREATE TABLE EMPLOYEES (
    EMP_ID NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    NAME VARCHAR2(50),
    DEPARTMENT VARCHAR2(30),
    SALARY NUMBER(10,2),
    HIRE_DATE DATE
);

INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Alice', 'Engineering', 95000, TO_DATE('2022-01-15', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Bob', 'Sales', 70000, TO_DATE('2021-06-20', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Charlie', 'HR', 60000, TO_DATE('2020-03-10', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Diana', 'Engineering', 105000, TO_DATE('2019-11-05', 'YYYY-MM-DD'));

7.运行pyspark的各种方式
  • 通过pyspark进入pyspark单机交互式环境。
    这种方式一般用来测试代码。
    也可以指定jupyter或者ipython为交互环境。

  • 通过spark-submit提交Spark任务到集群运行。
    这种方式可以提交Python脚本或者Jar包到集群上让成百上千个机器运行任务。
    这也是工业界生产中通常使用spark的方式。

  • 通过zepplin notebook交互式执行。
    zepplin是jupyter notebook的apache对应产品。

  • Python安装findspark和pyspark库。
    可以在jupyter和其它Python环境中像调用普通库一样地调用pyspark库。
    这也是本书配置pyspark练习环境的方式。

Related Post