What is Apache Parquet

What is Apache Parquet ?

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. it is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

In this tutorial, we will show you how to read and write parquet files

install parquet-tools

https://pypi.org/project/parquet-tools/

via pip

pip install parquet-tools

via brew on mac

brew install parquet-tools

CLI Command Examples

display usage

parquet-tools -h

display rowcount

parquet-tools rowcount [example_file].snappy.parquet

display row

parquet-tools head -n 1 [example_file].snappy.parquet

display with json output

parquet-tools cat --json hdfs://path/to/[example_file].parquet

display with csv output

parquet-tools csv input.gz.parquet | csvq -f json "select [column1], [column2]"

display file meta data

parquet-tools meta [example_file].snappy.parquet

Write Apache Parquet in python using spark

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("parquetFile").getOrCreate()
data = [
   ("Jacob ","","Smith","36636","M",3000),
   ("Alex ", " Mozi","","40288","M",4000),
   ("William ","","Brown","42114","M",4000),
   ("Linda ","Ann","Jones","39192","F",4000),
   ("Janet","Sarah","Brown","","F",-1)
]

columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)

Write to Parquet file

df.write.mode("overwrite").parquet("/tmp/output/people.parquet")

Write mode:

append: This mode appends the data from the DataFrame to the existing file, if the Destination files already exist. In Case the Destination files do not exists, it will create a new parquet file in the specified location.

Example:

df.write.mode("append").parquet("path/to/parquet/file")

overwrite: This mode overwrites the destination Parquet file with the data from the DataFrame. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("overwrite").parquet("path/to/parquet/file")

ignore: If the destination Parquet file already exists, this mode does nothing and does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("ignore").parquet("path/to/parquet/file")

error or errorIfExists: This mode raises an error if the destination Parquet file already exists. It does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("error").parquet("path/to/parquet/file")

Read from Parquet file

parDF1=spark.read.parquet("/tmp/output/people.parquet")

Create table from Parquet file to run queries

parDF1.createOrReplaceTempView("parquetTable")
parDF1.printSchema()
parDF1.show(truncate=False)

Sample query 1:

parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show(truncate=False)

Sample query 2:

spark.sql("CREATE TEMPORARY VIEW PERSON USING parquet OPTIONS (path \"/tmp/output/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()

Sample query 3:

df.write.partitionBy("gender","salary").mode("overwrite").parquet("/tmp/output/people2.parquet")

parDF2=spark.read.parquet("/tmp/output/people2.parquet/gender=M")
parDF2.show(truncate=False)

Sample query 4:


spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path \"/tmp/output/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Run Biz on Linux

install parquet-tools

Write Apache Parquet in python using spark

Other useful links and tools:

Leave a Reply Cancel reply

Recent Posts

Donate

install parquet-tools

Write Apache Parquet in python using spark

Other useful links and tools:

Share This

Leave a Reply Cancel reply

Recent Posts

Donate