About

I am a Data Engineer with 2.5 years of hands-on experience in creating robust and scalable big data systems. My work revolves around automating analytical processes and delivering critical data products that drive business decisions.


In my professional journey, I have excelled in designing, developing, and managing both batch and real-time streaming data pipelines. My expertise ensures the timely availability of analytical solutions, leveraging an extensive array of big data technologies in both on-premises and cloud-native environments. This experience has solidified my skills in problem-solving, distributed processing, data modeling, data analysis and visualization, data governance, and automating ETL flows using data orchestration tools. Additionally, I have a proven track record in deploying machine learning models.


I thrive on tackling challenging projects as they provide ample learning opportunities, contributing to both my technical and professional growth. My guiding philosophy is: “The World requires change. Change requires narrative. Narrative requires data.” This belief fuels my quest for knowledge and drives my passion for data engineering.


Beyond the industry, I spent 1.5 years as a Graduate Researcher, focusing on resource optimization using Reinforcement Learning. Currently, I am exploring online methods for adjusting the weights of large language models (LLMs) during training. My curiosity has also led me to dabble in IoT, home automation, and computer vision projects.


Further, I love sharing my knowledge and contributing to the data engineering community through my technical writing. I have authored various Medium articles talking about my learning and experiences. Additionally, I serve as an Editor for Data Engineer Things publication which is dedicated to curating original learning resources for data engineers on Medium platform.

Outside my professional life, I enjoy working out in the gym, watch documentaries, dance and explore nature and big time foodie.

Key Competencies

  • Accountable: proven experience in delivering high-impact results
  • Team player: great collaboration skills
  • Agile and adaptable: can quickly adapt to changing requirements
  • Positive influence: optimistic driver of new initiatives

Work Experience

YahooMountain View, California

Data Engineer InternMay 2023 - August 2023

Yahoo is a global media and tech company providing various services, including a web portal, search engine, news, email, and advertising services. During the internship, I was part of data plaforms team tasked with user-profiling and I worked on building core datasets, enabling data-driven decision-making on Yahoo's product improvements.

  • Architected AWS QuickSight dashboards, improving cloud security & operational efficiency, improved cost utilization & data visibility.
  • Developed an AWS SNS-based system for real-time data anomaly alertsin S3 buckets, boosting data integrity and quality check by 25%.
  • Automated data ingestion & optimized API queries to retrieve user data from Gemini, achieved 96% accuracy
  • Developed a BERT-based NLP model chatbot, reducing internal data query search time by 30%.
  • Skills: AWS QuickSight, Terraform, Airflow, DynamoDB, Python, Streamlit, BERT, SNS


Texas A&MCollege Station, Texas

Machine and Cloud researcherSeptember 2022 - Present

The Learning and Emerging Networked Systems (LENS) at Texas A&M aims to conduct analytical and experimental research in machine learning and reinforcement learning with networked systems applications.For 1.5 years, I worked as researcher under the resource optimization department. We are Trying to disrupt the field of orchestration and autoscaling system for allocating resources to microservice applications. Using zeroth gradient approaches and reinforcement learning methods to achieve these objectives.

  • Optimized data processing pipelines for a social media microservice application, achieving a 12% reduction in latency.
  • Utilized Docker for efficient containerization and integrated real-time data monitoring with Jaeger. Used reinforcement learning and policy gradient strategies for effective resource management, enhancing system scalability to exceed industry standards by 14%.
  • Extended microservices with Spring REST APIs, using Docker for containerization, Kubernetes for orchestration, and Nginx/Apache Tomcat for load balancing and secure token-based authentication.
  • Integrated real-time monitoring and ETL pipelines, reducing data processing time by 40%
  • Skills: Docker, Kubernetes, Spring, Nginx, Apache Tomcat, Spark, Telemetry, Grafana, Jaeger


Texas A&MCollege Station, Texas

Machine and Data EngineerSeptember 2022 - Present

  • Deployed machine learning models for plant stress detection, achieving 95% IOU with YOLOv8.
  • Architected full-stack web service for heat risk assessment, integrated databases, containerized and deployed it
  • Integrated MLOps pipelines for automation and scalability.
  • Skills: PyTorch, Django, Flask Docker, AWS SageMaker, NoSQL(MongoDB, DynamoDB), MySQL, Airflow, ECS.
  • Skills: Docker, Kubernetes, Spring, Nginx, Apache Tomcat, Spark, Telemetry, Grafana, Jaeger


Redbus (Bus & Train Tickeing Application)Bengaluru, India

Data EngineerJuly 2020 - July 2022

RedBus is an Multi-National online bus ticketing platform that facilitates the booking of bus tickets and other related services across multiple cities and operators. I worked for two years as a Data Engineer. I was part of the data lake team where i contributed in creation of data lake for unified view, developed ETL pipelines for inventory management, sales, payments & Marketing (CRM channels). Additionally developed dashboards & websites to drive the adoption of premium services offered to operators.

  • Developed and scaled 40+ ETL pipelines with AWS, Apache (NiFi, Kafka, Spark, Storm, Hadoop), processing 10M events daily. Optimized batch processing with Spark (8M+ records), saving 2 hours runtime.
  • Migrated legacy systems to AWS, creating a data lake with Redshift (15% retrieval time reduction, 25% storage cost savings). Warehousing (Redshift). Developed dashboards tailored to business needs. Worked with crossfunctional team from India and Singapore.
  • Engineered a Marketing and Sales Data Analysis Platform, integrating hot and cold data storage (PostgreSQL, AWS S3) and dynamic dashboards (Tableau, Redash), leading to a 2.5% increase in customer acquisition and an 8% rise in quarterly transactions. Under this initiative, built a real-time user interaction analytics pipeline (Kinesis, DynamoDB) processing 400K mobile app events hourly, delivering user engagement insights.
  • Built real-time pipelines for user interaction data on Android and payment analytics, processing 400K events hourly and reducing failure notification time by 80%, using Kinesis, Lambda, DynamoDB, S3, SNS, Kafka, and Spark-stream.
  • Architected data-driven rating system for 3,500 bus operators across India and Singapore, implementing ETL pipeline with cold/hot storage in Redshift/DynamoDB. Developed 3NF-compliant models, Tableau dashboards, and Django REST API-powered web app. Resulted in 20% increase in Premium service adoption among operators
  • Deployed surge pricing & anomaly detection models using AWS SageMaker (CI/CD with Kubeflow, Jenkins, GitHub Actions). Achieved 25% faster model updates, leading to $1M revenue increase & 40% fraud detection improvement.

InternJan 2020 - June 2020

During my internship i worked on wide array of IoT,NLP, computer-vision based projects as i was part of sole R&D team.

  • Developed an IoT - computer vision based solution for real-time bus footfall tracking, resulting in a 17% reduction in pilferage.
  • Created onboard diagnostics system to capture & display bus performance. Reduced daily fuel consumption by 500 gallons.
  • Created an Alexa application & streamlined ticketing for voice-command bus ticket bookings, serving 20000 customers.


Electronics Cooperation of India LimitedHyderabad, India

InternJune 2018 - August 2018

Electronics Corporation of India Limited (ECIL) is a Government of India enterprise that develops and manufactures electronic products for the defense, nuclear, and industrial sectors.

  • Designed a dashboard using grafana and JSON-API to visualize the load characteristics of a local electric grid.

Education

Texas A&M UniversityCollege Station, Texas

Master of Science in Computer EngineeringAugust 2022 - Dec 2024

Relevant Coursework: Mathematics for Signal Processing,Data Analytics, Machine Learning, Database Systems, Distributed Systems & Cloud Computing, Operating Systems, Parallel Computing

Awards: Graduate Merit Departmental Scholarship, 3rd place Texas A&M Institute of Data Science Competition.



JSS Science & Technology UniversityMysore, India

Bachelor of Engineering in Electronics and Communication EngineeringAugust 2016 - September 2020

Relevant Coursework: Software Engineering, Data Structures and Algorithms, Operating Systems, Linear Algebra and applications, Computer Networks

Organizations: AeroJC (Aviation Club Head), part of IEEE student org

Projects

Explore my data engineering and Machine Learning projects in this section.

G Maps GPT Code

Tech Stack: Python, Flask, Neon Tech (Serverless PostgreSQL), OpenAI GPT Models, LangChain, LangGraph, Google Maps API, Chart.js, ReportLab

G Maps GPT is an intelligent assistant for querying and analyzing condominium data in Miami, designed for non-technical users like real estate agents and investors. It offers a natural language interface to access condo sales, market trends, and building data, integrates Google Maps for location-based queries, and visualizes insights using charts, graphs, and interactive maps. Features include PDF report generation, spelling correction using vector embeddings, serverless database scaling, and a curated prompt to reduce hallucinations.



Real Time Traffic Video stream analysis Code

Tech Stack: python, AWS, Databricks, Streamlit, Flask, OpenCV, Snowflake, PowerBI

A full-stack Real-Time Video Stream Analytics application utilizing Streamlit, Flask, Apache Spark on Databricks, and AWS, enhanced by CI/CD practices, to perform and visualize advanced object detection and counting from vehicular traffic videos



Deeplearning Image classifier model deployment Code

Tech Stack: Flask, Keras, MLflow, DVC, AWS (S3, ECR, EC2), Docker

A comprehensive MLOps pipeline for a plant leaf health classification web application, showcasing data versioning, experiment tracking, CI/CD, containerization, and cloud deployment.



AWS ETL pipeline that does Youtube data analysis Code

Tech Stack: AWS, Python, Spark

I designed and implemented a comprehensive data architecture leveraging AWS services. By utilizing S3 for storage, Lambda, Spark, and Athena, I built an end-to-end data pipeline to transform and analyze a YouTube trending videos dataset with over 1 million rows. I orchestrated the ETL workflow in Glue Studio and created QuickSight dashboards to visualize key metrics.



Real time sentiment analysis Code

Tech Stack: Flask, Spark, Kafka, Airflow, AWS S3, glue, Quicksight, docker

A datapipeline that in real time stream and batch takes the data from users input from a web application. The data is subjected to data processing and sentiment analysis. The web application is made by Flask.



Data pipeline for healthcare industry Code

Tech Stack: Amazon S3, PostgresSQL, MySQL, RDS, Spark, Power BI

The Healthcare Data Pipeline project creates a unified data storage solution for large healthcare datasets using Spark on EMR and S3, with Power BI dashboards for visualization, enabling efficient data ingestion, cleansing, transformation, and insightful analysis.



Predicting large scale wildfire Code

Tech Stack: Deep Neural Networks, AWS, SHAP Analysis, NLP

PyTorch-based neural network to predict wildfires with 95% accuracy, utilizing SHAP analysis and NLP techniques for early detection through social media analysis, enhancing data-driven emergency response strategies.



Taxi Data analytics and ETL pipeline Code

Tech Stack: GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.

We extract data from a web api and automate the ETL using Mage and store the data in Big Query and dashboard made using Looker is built on top of that.

Skills

Here is a snapshot of data engineering skills that I bring to the table.

Programming Languages

  • Python
  • Java
  • Scala
  • SQL
  • C++

Big Data

  • Snowflake
  • Hadoop
  • Spark
  • Kafka
  • Airflow
  • Storm

Databases

  • MySQL
  • PostgreSQL
  • NoSQL (MongoDB, Cassandra)
  • Oracle

Visualization and Other Tools

  • AWS Quicksight
  • Tableau

Cloud and Containers

  • AWS (S3, EC2, Glue, Athena, Lambda, EMR, IAM)
  • Azure (Data Factory, Data Lake, Databricks, Synapse Analytics)
  • Docker

Certifications

Devops Tools

  • Git
  • Jira
  • Cloud Formation
  • Terraform
  • Jenkins

Contact

Feel free to reach out to me on the details mentioned below.

Write to me:    prathikvijaykumar@tamu.edu  

Follow: