Ketan Keshri - The Data Engineer

Posts

unzip a zip file stored on S3 without downloading it

- July 04, 2025

unzip a zip file stored on S3 without downloading it AWS S3 is an object storage , not a file system. Unzipping is not supported directly. You need some computation for unzipping, which can be achieved by copying that file to some EC2 machine or use some Lambda . Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri/ medium: https://medium.com/@ketankkeshri

Resolve SSL Certificate issue while pip install in Python

- July 04, 2025

SSL Certificate issue in python pip If you are facing issues while installing pip packages, $ pip install pandas To resolve it, we need to set the certificate path. 1. Find the Cert path. 2. Set the cert path. 3. Run pip install In MacBook: Find the ssl Path first: Run this command in terminal $ python -c "import ssl; print(ssl.get_default_verify_paths())" OR $ python3 -c "import ssl; print(ssl.get_default_verify_paths())" It will give output something like this DefaultVerifyPaths(cafile='/opt/homebrew/etc/openssl@3/cert.pem', capath='/opt/homebrew/etc/openssl@3/certs', openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/opt/homebrew/etc/openssl@3/cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/opt/homebrew/etc/openssl@3/certs') Now set this path before running pip command, Run this in your terminal: $ export REQUESTS_CA_BUNDLE="/opt/homebrew/etc/openssl@3/cert.pem" You can set this path in...

Data Engineering Basics to Advance: Phase-IV- Capstone Projects

- July 03, 2025

Capstone Projects Project Ideas: Batch Pipeline Source: CSV on S3 Process: PySpark/dbt Sink: Redshift/BigQuery Orchestrate: Airflow Streaming Pipeline Source: Kafka (clickstream or logs) Process: Spark Streaming Sink: ElasticSearch or S3 Data Observability Implement Great Expectations Data profiling and alerting Cloud-native Data Lakehouse Glue + Iceberg + Athena + dbt Use partitioning, schema evolution Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Data Engineering Basics to Advance: Phase-III- Advanced

- July 03, 2025

Advanced opics: Big Data Ecosystem Apache Spark (core, DataFrames, PySpark) Hive/Presto/Athena Hadoop (just architecture overview) Data Lakes & Lakehouses Concepts: data lake, warehouse, lakehouse Table formats: Apache Iceberg , Delta Lake , Hudi Glue, Athena, Iceberg setup Streaming Systems Kafka: pub/sub, brokers, partitions Kafka Connect, schema registry Apache Flink or Spark Structured Streaming (basics) Cloud Data Warehouses BigQuery, Redshift, Snowflake: architecture & querying Partitioning, clustering, optimization Monitoring & Observability Logging (CloudWatch, Stackdriver) Data quality with Great Expectations Lineage tools: OpenMetadata, Amundsen Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagra...

Data Engineering Basics to Advance: Phase-II- Intermediate

- July 03, 2025

Intermediate Topics: Databases Relational (PostgreSQL, MySQL) NoSQL (MongoDB, Redis, Cassandra – intro only) Data warehousing concepts (Star, Snowflake schema) ETL/ELT & Data Pipelines Difference between ETL & ELT Hands-on with tools: Airflow (or Prefect): DAGs, operators, scheduling dbt : modeling, tests, macros, incremental models Cloud Basics Intro to AWS/GCP/Azure S3, IAM, Lambda, CloudWatch (focus AWS if unsure) Basic infra concepts: VPC, subnets, security File Formats & Serialization CSV, JSON, Avro, Parquet Compression: gzip, snappy Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Data Engineering Basics to Advance: Phase-I- Foundations

- July 03, 2025

Foundations Topics: Python Programming Basics: variables, loops, functions, OOP Data structures: lists, dicts, sets Working with files, error handling Libraries: pandas , requests , json SQL (Structured Query Language) SELECT, WHERE, GROUP BY, JOINS Window functions, CTEs, Subqueries Practice: LeetCode SQL, Mode Analytics SQL Tutorial Data Fundamentals CSV, JSON, Parquet formats Data types, schemas, data quality Basics of data modeling (OLTP vs OLAP) Version Control Git basics, GitHub flow Collaboration, pull requests Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Search This Blog

Ketan Keshri - The Data Engineer

Posts

Bhakti-Aarti- Android app Privacy policy

unzip a zip file stored on S3 without downloading it

Resolve SSL Certificate issue while pip install in Python

Data Engineering Basics to Advance: Phase-IV- Capstone Projects

Data Engineering Basics to Advance: Phase-III- Advanced

Data Engineering Basics to Advance: Phase-II- Intermediate

Data Engineering Basics to Advance: Phase-I- Foundations