Tutorial: optimal binning sketch with binary target using PySparkΒΆ
In this example, we use PySpark mapPartitions function to compute the optimal binning of a single variable from a large dataset in a distributed fashion. The dataset is split into 4 partitions.
from pyspark.sql import SparkSession
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.read.csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv",
sep=",", header=True, inferSchema=True)
n_partitions = 4
df = df.repartition(n_partitions)
We prepare the MapReduce structure
import pandas as pd
from optbinning import OptimalBinningSketch
variable = "EXT_SOURCE_3"
target = "TARGET"
columns = [variable, target]
def add(partition):
df_pandas = pd.DataFrame.from_records(partition, columns=columns)
x = df_pandas[variable]
y = df_pandas[target]
optbsketch = OptimalBinningSketch(eps=0.001)
optbsketch.add(x, y)
return [optbsketch]
def merge(optbsketch, other_optbsketch):
optbsketch.merge(other_optbsketch)
return optbsketch
Finally, with the required columns, we use mapPartitions and method
treeReduce to aggregate the OptimalBinningSketch
instance of each partition.
optbsketch = df.select(columns).rdd.mapPartitions(lambda partition: add(partition)
).treeReduce(merge)