ref: https://github.com/lyhue1991/eat_pyspark_in_10_days
1.安装Java8
Java 1.8 (Java 8) installed. That will work for older versions of Scala and Spark, but for modern Spark (3.x) and Scala (2.13 / 3.x), it’s recommended to use Java 17 or later.
sudo dnf install java-17-openjdk-devel -y
sudo alternatives --config java
There are 2 programs which provide 'java'.
Selection Command
-----------------------------------------------
*+ 1 /usr/java/jdk1.8.0_202-amd64/jre/bin/java
2 java-17-openjdk.x86_64 (/usr/lib/jvm/java-17-openjdk-17.0.17.0.10-1.el8.x86_64/bin/java)
Enter to keep the current selection[+], or type selection number: 2
sudo alternatives --config javac
There are 2 programs which provide 'javac'.
Selection Command
-----------------------------------------------
*+ 1 /usr/java/jdk1.8.0_202-amd64/bin/javac
2 java-17-openjdk.x86_64 (/usr/lib/jvm/java-17-openjdk-17.0.17.0.10-1.el8.x86_64/bin/javac)
Enter to keep the current selection[+], or type selection number: 2
export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
安装成功后,在命令行中输入 java -version
2.安装Python
3.安装Spark
spark官网下载: http://spark.apache.org/downloads.html
下载后解压放入到一个常用软件的安装路径,如:
/opt/spark/spark-3.4.4-bin-hadoop3
export PYTHON_TOP_DIR=/usr/local/bin
export PYTHONPATH=/usr/bin/python3.11
export SPARK_HOME=/opt/spark/spark-3.4.4-bin-hadoop3
export PYSPARK_PYTHON=PYTHONPATH
export PYSPARK_DRIVER_PYTHON=PYTHONPATH
export PATH=SPARK_HOME/bin:PYTHON_TOP_DIR:$PATH
spark-submit --version
spark-shell --version
4.安装findspark
pip install findspark
pip list
Package Version
---------- -------
findspark 2.0.1
pip 23.2.1
setuptools 68.2.2
安装成功后可以运行如下代码:
import findspark
import pyspark
from pyspark import SparkContext, SparkConf
#指定spark_home为刚才的解压路径,指定python路径
spark_home = "/Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2"
python_path = "/Users/liangyun/anaconda3/bin/python"
findspark.init(spark_home,python_path)
conf = SparkConf().setAppName("test").setMaster("local[4]")
sc = SparkContext(conf=conf)
print("spark version:",pyspark.__version__)
rdd = sc.parallelize(["hello","spark"])
print(rdd.reduce(lambda x,y:x+' '+y))
spark version: 3.4.4
hello spark
5.Install Scala
sudo dnf install scala -y
scala -version
Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL
6.Install JDBC for oracle
Go to the official Oracle download page:
https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html
Choose the ojdbc8.jar (for Java 8–17) or ojdbc11.jar (for Java 11+).
Accept the license agreement and download the .jar file.
7.Install sqlplus
dowload : https://www.oracle.com/database/technologies/instant-client/linux-x86-64-downloads.html?referrer=grok.com
- instant client –> oracle-instantclient19.29-basic-19.29.0.0.0-1.x86_64.rpm
- sqlplus –> oracle-instantclient19.29-sqlplus-19.29.0.0.0-1.x86_64.rpm
Testing connection:
sqlplus spark/spark@localhost:1521/HAMSTER
Create a table
CREATE TABLE EMPLOYEES (
EMP_ID NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
NAME VARCHAR2(50),
DEPARTMENT VARCHAR2(30),
SALARY NUMBER(10,2),
HIRE_DATE DATE
);
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Alice', 'Engineering', 95000, TO_DATE('2022-01-15', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Bob', 'Sales', 70000, TO_DATE('2021-06-20', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Charlie', 'HR', 60000, TO_DATE('2020-03-10', 'YYYY-MM-DD'));
INSERT INTO EMPLOYEES (NAME, DEPARTMENT, SALARY, HIRE_DATE) VALUES ('Diana', 'Engineering', 105000, TO_DATE('2019-11-05', 'YYYY-MM-DD'));
7.运行pyspark的各种方式
-
通过pyspark进入pyspark单机交互式环境。
这种方式一般用来测试代码。
也可以指定jupyter或者ipython为交互环境。 -
通过spark-submit提交Spark任务到集群运行。
这种方式可以提交Python脚本或者Jar包到集群上让成百上千个机器运行任务。
这也是工业界生产中通常使用spark的方式。 -
通过zepplin notebook交互式执行。
zepplin是jupyter notebook的apache对应产品。 -
Python安装findspark和pyspark库。
可以在jupyter和其它Python环境中像调用普通库一样地调用pyspark库。
这也是本书配置pyspark练习环境的方式。