Sumit Kumar
Passionate about building scalable data pipelines, real-time analytics systems, and exploring machine learning solutions for cybersecurity and financial wellness domains.


About Me
My Journey

I'm Sumit Kumar, a data-driven problem solver with experience in building large-scale ETL systems, developing real-time data pipelines, and working with high-impact data products across FinTech and cybersecurity domains.
I believe in the power of data to transform businesses and lives. My focus is on creating robust, scalable solutions that not only solve immediate problems but enable future growth.
0+
Years Experience
0+
Projects Completed
Experience
Data Engineer
Engineered a custom Java-based Kafka Connect SMT library to process Debezium CDC events from AWS DocumentDB, eliminating...
Data Engineer
Data Migration Framework (Mar 2023– Nov 2024): Developed a scalable ETL framework for PayPal’s data migration using Pyth...
Data Science Research Intern
Developed a real-time forest fire detection system using Python-based ML algorithms and fuzzy logic, achieving 90% accur...
Technical Skills
Technologies and tools I work with professionally
My Work / Projects
Showcase of my professional work, internships, and side projects.
Featured Work

Smaxiso Writes
A collection of poetry exploring themes of love, life, and introspection. Hosted poetry portfolio.

Smaxiso Writes
A collection of poetry exploring themes of love, life, and introspection. Hosted poetry portfolio.

AI Hub
Centralized hub for AI models and tools, streamlining access and management of various artificial intelligence resources.

Local RAG
Local Retrieval-Augmented Generation system for private document chat, enabling secure and offline AI interactions with personal data.

Real-time Transaction Normalization & Scalable Data Lake Design
Built a real-time transaction normalization pipeline using Kafka (AWS MSK), ECS, and Java (Spring Boot) for fraud detection. Optimised the normalization service for vendor compatibility. Designed a scalable Data Lake using S3 Hudi and optimized ETL pipelines for Athena querying. Built a Java-based test automation service.

School Chale Ham
An academic blogging platform for K-12 education. Features include efficient blog creation and management. Backend powered by Express.js with MongoDB; Frontend built with Next.js.

CLI Chat Bot
Command-line interface chatbot focused on developer productivity, offering quick access to tools and information via terminal.

VS Code Productivity Extension
Custom Visual Studio Code extension built to enhance developer workflows and automate repetitive coding tasks. Features include enhanced markdown previewing and snippet automation.

Contextual News System
AI-driven news aggregation system using NLP for content personalization and relevance.

Data Migration Framework
Developed a scalable Data Migration Framework for a global payments company using Python, AWS, GCS, and BigQuery. Reduced data migration time by 20% and improved scalability by 30%.

Lynx Framework Optimization
Optimized the Lynx entity linkage framework, improving accuracy and efficiency. Enhanced the Locality-Sensitive Hashing (LSH) algorithm, reducing nearest neighbor search time by 40%. Streamlined feature aggregation pipelines, reducing processing latency by 25%.

Google Search Pro
Advanced search scraper and utility tool designed for enhanced information retrieval and data extraction efficiency.

Android Bloatware Removal Guide
A basic blog webpage created for the Android bloatware removal guide. Simple and informative design for easy navigation.

Reporting Framework
Developed on-demand merchant reporting solutions, improving data accuracy by 15% and reducing report generation time by 25%. Contributed to the Argo Framework for report generation at scale.

Fuzzy Control System for Forest Fire Detection
Developed a real-time forest fire detection system utilizing Python-based machine learning algorithms and fuzzy logic. Achieved 90% accuracy in predicting the likelihood and severity of forest fires.

Bihar COVID Help
Built a resource-sharing platform for COVID-19 relief, connecting volunteer doctors with patients and aggregating information on critical supplies like oxygen and hospital beds.

Joint Image Compression & Encryption
Created an algorithm for joint image compression and encryption using lossless JPEG2000 and RC4 encryption. Achieved a compression ratio of 5.2, 99.69% NPCR, and 47.63% UACI in processed images.

Digital Image Compression
Designed an algorithm for digital image compression using K-means clustering and PCA. Achieved a compression ratio of 2.8 with 55-70% compression and a PSNR of 30 and above. Collaborated with a team of 4 members.

Image Classification using CNNs
Led a team of 4 in developing a machine learning-based CNN model for image classification, utilizing the CIFAR10 dataset. Achieved a high accuracy rate of 87.44% on the training dataset and 82.5% on the testing dataset.

YouTube Spam Comment Filter
Developed and implemented a machine learning algorithm for binary classification of spam and non-spam comments on YouTube videos. Achieved an accuracy rate of 96.21% through training on multiple datasets.