Ketan Keshri - The Data Engineer

Posts

Showing posts from July, 2025

Order of SQL query execution

- July 24, 2025

In SQL, the order you write the clauses in a query is not the same as the order in which the database engine executes them. Here’s the logical order of execution of SQL clauses: Step Clause Description 1️⃣ FROM Identifies source tables and joins data. 2️⃣ WHERE Filters rows before grouping. 3️⃣ GROUP BY Groups rows based on specified columns. 4️⃣ HAVING Filters groups after aggregation. 5️⃣ SELECT Selects columns or expressions to return. 6️⃣ DISTINCT Removes duplicate rows from the result set. 7️⃣ ORDER BY Sorts the result set. 8️⃣ LIMIT / OFFSET Returns a subset of the result (pagination). EXAMPLE: SELECT department, COUNT(*) AS employee_count FROM employees WHERE status = 'active' GROUP BY department HAVING COUNT(*) > 10 ORDER BY employee_count DESC LIMIT 5; Execution order: FROM employees WHERE status = 'active' GROUP BY department HAVING COUNT(*) > 10 SELECT department, COUNT(*) AS employee_count ORDER BY employee_count DE...

AWS Lake formation - AWS LF - Governance Security- Access control

- July 10, 2025

🧭 Overview: What is AWS Lake Formation? AWS Lake Formation is a service that simplifies building a secure data lake by: Ingesting data from various sources Organizing it in Amazon S3 Setting up data catalogs (via AWS Glue) Defining security and access policies Querying data with services like Athena, Redshift, and EMR 🛠️ Prerequisites Before starting, ensure you have: An AWS account IAM permissions for Lake Formation, Glue, S3, and IAM An existing S3 bucket (or create a new one) 🧱 Step 1: Set Up a Data Lake Location Go to the Lake Formation Console . In the left pane, choose "Data lake locations" . Click "Register location" . Choose your S3 bucket or a folder (e.g., s3://your-bucket/data/ ). Choose an IAM role that has permission to access this location. 📋 Step 2: Add a Data Catalog Table From the Lake Formation Console, go to "Databases" . Click "Create database" (this...

DBT tool connect Athena from Local- AWS SSO

- July 08, 2025

USE AWS SSO login to access Athena using DBT outputs : dev : type : athena s3_staging_dir : s3://test-bucket/athena/ region_name : ap-south-1 database : awsdatacatalog schema : <db_name> work_group : < Athena-workgroup > aws_profile_name : <profile-name-on-local-machine>

Get All the tables in a AWS GLUE database using AWS cli

- July 07, 2025

Name of all the tables in a DB: $ aws glue get-tables --database-name <your-db-name> --output json --query 'TableList[*].Name' --profile <aws-profile-name> Count of tables in the DB $ aws glue get-tables --database-name <your-db-name> --output json --query 'TableList[*].Name' --profile <aws-profile-name> | jq length

Connect to MySQL through Jump servers tunnel

- July 06, 2025

How to connect MySQL through Jump servers: When you don't have direct access to mysql-server, you use jump-server. From your machine, you connect(ssh) to jump-server and from there you connect to your mysql-server. This can be avoided by using ssh- tunneling. Suppose your jump server is `jump-ip` mysql server is `mysql-ip` your machine is `machine-ip` Just open ssh client(Putty in windows or terminal in linux/ios). Type: ssh -L [local-port]:[mysql-ip]:[mysql-port] [jump-server-user]@[jump-ip] After this, you can use your localhost and local-port to access mysql-server on the remote machine directly. Eg. Your Jdbc url to access mysql database, in that case, will be jdbc:mysql://localhost:[local-port]/[database-name]

Benefits of Apache Parquet Format in big fata

- July 05, 2025

Benefits of Parquet Format Columnar Storage Efficient for analytics and read-heavy workloads . Only required columns are read into memory. Highly Compressed Supports efficient compression algorithms (Snappy, GZIP, Brotli). Smaller file size compared to row-based formats like CSV/JSON. Splittable & Scalable Files can be split and read in parallel , improving speed in distributed systems like Hadoop/Spark. Schema Evolution Supports adding new columns without breaking existing data pipelines. Efficient for Queries Works well with SQL engines like Hive, Presto, Trino, Athena, BigQuery. Better IO Performance Reduces disk and network IO by avoiding unnecessary data reads. Interoperable Supported across multiple languages and platforms (Python, Java, Spark, Hive, AWS, GCP, etc.). Self-describing Format Stores schema as metadata within the file itself — no need for external schema definitions. Great with Partitioning When used wi...

unzip a zip file stored on S3 without downloading it

- July 04, 2025

unzip a zip file stored on S3 without downloading it AWS S3 is an object storage , not a file system. Unzipping is not supported directly. You need some computation for unzipping, which can be achieved by copying that file to some EC2 machine or use some Lambda . Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri/ medium: https://medium.com/@ketankkeshri

Resolve SSL Certificate issue while pip install in Python

- July 04, 2025

SSL Certificate issue in python pip If you are facing issues while installing pip packages, $ pip install pandas To resolve it, we need to set the certificate path. 1. Find the Cert path. 2. Set the cert path. 3. Run pip install In MacBook: Find the ssl Path first: Run this command in terminal $ python -c "import ssl; print(ssl.get_default_verify_paths())" OR $ python3 -c "import ssl; print(ssl.get_default_verify_paths())" It will give output something like this DefaultVerifyPaths(cafile='/opt/homebrew/etc/openssl@3/cert.pem', capath='/opt/homebrew/etc/openssl@3/certs', openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/opt/homebrew/etc/openssl@3/cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/opt/homebrew/etc/openssl@3/certs') Now set this path before running pip command, Run this in your terminal: $ export REQUESTS_CA_BUNDLE="/opt/homebrew/etc/openssl@3/cert.pem" You can set this path in...

Data Engineering Basics to Advance: Phase-IV- Capstone Projects

- July 03, 2025

Capstone Projects Project Ideas: Batch Pipeline Source: CSV on S3 Process: PySpark/dbt Sink: Redshift/BigQuery Orchestrate: Airflow Streaming Pipeline Source: Kafka (clickstream or logs) Process: Spark Streaming Sink: ElasticSearch or S3 Data Observability Implement Great Expectations Data profiling and alerting Cloud-native Data Lakehouse Glue + Iceberg + Athena + dbt Use partitioning, schema evolution Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Data Engineering Basics to Advance: Phase-III- Advanced

- July 03, 2025

Advanced opics: Big Data Ecosystem Apache Spark (core, DataFrames, PySpark) Hive/Presto/Athena Hadoop (just architecture overview) Data Lakes & Lakehouses Concepts: data lake, warehouse, lakehouse Table formats: Apache Iceberg , Delta Lake , Hudi Glue, Athena, Iceberg setup Streaming Systems Kafka: pub/sub, brokers, partitions Kafka Connect, schema registry Apache Flink or Spark Structured Streaming (basics) Cloud Data Warehouses BigQuery, Redshift, Snowflake: architecture & querying Partitioning, clustering, optimization Monitoring & Observability Logging (CloudWatch, Stackdriver) Data quality with Great Expectations Lineage tools: OpenMetadata, Amundsen Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagra...

Data Engineering Basics to Advance: Phase-II- Intermediate

- July 03, 2025

Intermediate Topics: Databases Relational (PostgreSQL, MySQL) NoSQL (MongoDB, Redis, Cassandra – intro only) Data warehousing concepts (Star, Snowflake schema) ETL/ELT & Data Pipelines Difference between ETL & ELT Hands-on with tools: Airflow (or Prefect): DAGs, operators, scheduling dbt : modeling, tests, macros, incremental models Cloud Basics Intro to AWS/GCP/Azure S3, IAM, Lambda, CloudWatch (focus AWS if unsure) Basic infra concepts: VPC, subnets, security File Formats & Serialization CSV, JSON, Avro, Parquet Compression: gzip, snappy Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Data Engineering Basics to Advance: Phase-I- Foundations

- July 03, 2025

Foundations Topics: Python Programming Basics: variables, loops, functions, OOP Data structures: lists, dicts, sets Working with files, error handling Libraries: pandas , requests , json SQL (Structured Query Language) SELECT, WHERE, GROUP BY, JOINS Window functions, CTEs, Subqueries Practice: LeetCode SQL, Mode Analytics SQL Tutorial Data Fundamentals CSV, JSON, Parquet formats Data types, schemas, data quality Basics of data modeling (OLTP vs OLAP) Version Control Git basics, GitHub flow Collaboration, pull requests Let’s connect.. Github: https://github.com/ketankkeshri LinkedIn: https://in.linkedin.com/in/ketankeshri YouTube: https://www.youtube.com/@KetanKKeshri Instagram: https://www.instagram.com/ketankkeshri medium: https://medium.com/@ketankkeshri

Search This Blog

Ketan Keshri - The Data Engineer

Posts

Bhakti-Aarti- Android app Privacy policy

Order of SQL query execution

AWS Lake formation - AWS LF - Governance Security- Access control

DBT tool connect Athena from Local- AWS SSO

Get All the tables in a AWS GLUE database using AWS cli

Connect to MySQL through Jump servers tunnel

Benefits of Apache Parquet Format in big fata

unzip a zip file stored on S3 without downloading it

Resolve SSL Certificate issue while pip install in Python

Data Engineering Basics to Advance: Phase-IV- Capstone Projects

Data Engineering Basics to Advance: Phase-III- Advanced

Data Engineering Basics to Advance: Phase-II- Intermediate

Data Engineering Basics to Advance: Phase-I- Foundations