{"id":43,"date":"2025-04-16T05:29:32","date_gmt":"2025-04-16T05:29:32","guid":{"rendered":"https:\/\/blogs.stei.itb.ac.id\/waskita\/?p=43"},"modified":"2025-04-16T05:41:27","modified_gmt":"2025-04-16T05:41:27","slug":"apache-spark-vs-pandas","status":"publish","type":"post","link":"https:\/\/blogs.stei.itb.ac.id\/waskita\/2025\/04\/16\/apache-spark-vs-pandas\/","title":{"rendered":"Apache Spark vs Pandas"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"532\" src=\"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-content\/uploads\/sites\/2\/2025\/04\/Apache_Spark_logo.svg_.png\" alt=\"\" class=\"wp-image-44\"\/><\/figure>\n\n\n\n<p>Apache Spark&#8217;s PySpark API allows you to write Spark applications in Python, enabling parallel data processing across a cluster. Here&#8217;s a simple example of reading data, transforming it, and writing the result, compared to a traditional Python Pandas approach:<\/p>\n\n\n\n<p>PySpark Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pyspark.sql import SparkSession\nimport pyspark.sql.functions as F\n\n# Create a SparkSession\nspark = SparkSession.builder.appName(\"PySpark Example\").getOrCreate()\n\n# Read data from a CSV file (replace with your data source)\ndf = spark.read.csv(\"path\/to\/your\/data.csv\", header=True, inferSchema=True)\n\n# Transform the data (e.g., add a new column)\ndf = df.withColumn(\"processed_column\", F.col(\"existing_column\") * 2)\n\n# Display the transformed data (optional)\ndf.show()\n\n# Write the transformed data to a new CSV file\ndf.write.csv(\"path\/to\/output\/data.csv\", header=True, mode=\"overwrite\")\n\n# Stop the SparkSession\nspark.stop()<\/code><\/pre>\n\n\n\n<p>Traditional Pandas Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Read data from a CSV file (replace with your data source)\ndf = pd.read_csv(\"path\/to\/your\/data.csv\")\n\n# Transform the data (e.g., add a new column)\ndf&#91;\"processed_column\"] = df&#91;\"existing_column\"] * 2\n\n# Display the transformed data (optional)\nprint(df)\n\n# Write the transformed data to a new CSV file\ndf.to_csv(\"path\/to\/output\/data.csv\", index=False)<\/code><\/pre>\n\n\n\n<p>Key Differences:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Distributed Processing:<\/strong> PySpark enables you to process large datasets across multiple nodes in a cluster, while Pandas processes data on a single machine.<\/li>\n\n\n\n<li><strong>DataFrames:<\/strong> Both PySpark and Pandas use DataFrames, a tabular data structure, to represent data. However, PySpark DataFrames are distributed and optimized for parallel processing, while Pandas DataFrames are local and single-threaded.<\/li>\n\n\n\n<li><strong>Libraries:<\/strong> PySpark uses functions like <code>pyspark.sql.functions<\/code> for transformations, while Pandas uses functions like <code>df.transform()<\/code> and <code>df.apply()<\/code>.<\/li>\n\n\n\n<li><strong>Scalability:<\/strong> PySpark is designed for handling large datasets and complex transformations at scale, while Pandas may become slow or memory-intensive with very large datasets.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><\/li>\n<\/ul>\n\n\n\n<p>In essence, PySpark extends Python&#8217;s data manipulation capabilities to the distributed environment of Apache Spark, enabling efficient and scalable data processing for big data tasks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/medium.com\/data-science\/examples-of-using-apache-spark-with-pyspark-using-python-f36410457012\">Examples of Using Apache Spark with PySpark Using Python<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/medium.com\/@eshant.sah\/spark-vs-other-big-data-tools-why-spark-reigns-supreme-part-1-165a19096ef1\">Why Spark Reigns Supreme<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.instaclustr.com\/education\/8-amazing-apache-spark-use-cases-with-code-examples\/\">8 Amazing Apache Spark Use cases with code examples<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/bootcampai.medium.com\/best-practices-for-optimizing-data-processing-at-scale-with-apache-spark-7cb046939ae0\">Best Practices for Optimizing Data Processing at Scale with Apache Spark<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.percona.com\/blog\/apache-spark-makes-slow-mysql-queries-10x-faster\/\">How Apache Spark makes you slow MySQL queries 10x faster<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark&#8217;s PySpark API allows you to write Spark applications in Python, enabling parallel data processing across a cluster. Here&#8217;s a simple example of reading data, transforming it, and writing the result, compared to a traditional Python Pandas approach: PySpark Example: Traditional Pandas Example: Key Differences: In essence, PySpark extends Python&#8217;s data manipulation capabilities to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-43","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/posts\/43","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/comments?post=43"}],"version-history":[{"count":4,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/posts\/43\/revisions"}],"predecessor-version":[{"id":48,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/posts\/43\/revisions\/48"}],"wp:attachment":[{"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/media?parent=43"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/categories?post=43"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.stei.itb.ac.id\/waskita\/wp-json\/wp\/v2\/tags?post=43"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}