• Developed and deployed the project that automatically raises alert on daily basis for the unexpected data on the cluster. This project replaced manual effort of detecting and handling non-compliant data by 80%.
Technologies: Python, Hadoop, HIVE, Unix, Cron, MySQL
• Researched on data governance solutions that AT&T can take advantage of to ensure that data meets compliance and privacy requirements
• Researched on blockchain technology to create a data audit trail
• Working on creating a solution that can help analyze and visualize the data so as to find various areas of profitability:
1. Data Ingest
• Ingest data from different file formats into HDFS
• Load data into and out of HDFS using the Hadoop File System commands
2. Transform, Stage, and Store
• Convert a set of data values in a given format stored in HDFS into new data values or a new data format and wrote them into HDFS.
• Load RDD data from HDFS for use in Spark applications
• Read and write files in a variety of file formats
• Perform standard extract, transform, load (ETL) processes on data
3. Data Analysis using Spark SQL
• Query DataFrames in Spark
• Filter data using Spark
• Write queries that calculate aggregate statistics
• Join disparate datasets using Spark
• Produce sorted data
• Built a solution for text analysis to identify and categorize the cost as spend or repair.
• Developed an Ingestion Pipeline to pull data from RDBMS, transform the data and automatically generate matrix every month
Technologies: Spark, SparkSQL, Scala, Hadoop, HIVE, Pig, Solr, Banana, Sqoop