Data Analytics with R – MITU Skillologies – Aritificial Intelligence, Data Science Training and Development

Software Requirements:
Operating System: Ubuntu 16.04 LTS.

Hardware Requirements:
Processor: Pentium Dual Core +
Internet Connection

Prerequisites: Basic knowledge of programming and data analysis.

Contents

Module-1

Getting started: The basics of R
–   Why and what is R?
–   Setting up your machine
–   R Studio and R-Core
R Programming Constructs
–   Basic Syntax, Data Types
–   Variables, Operators,
–   Decision Making,
–   Loops, Functions,
–   Strings, Vectors,
–   Lists, Matrix, Array,
–   Factors, Data Frames
–   Data shaping and reshaping
–   R packages use and installation

Module-2

R Data Interfaces
–   Loading the CSV Files
–   Excel Files
–   Web data
–   XML Data
–   Database connectivity—MySQL
–   Inter-data communication and conversion
R charts and Graphs
–   Drawing Pie Charts
–   Bar Charts
–   Box plots
–   Histograms
–   Line Graphs
–   Scatter Plots
The ggplot2 package, plotrix package

Module-3

Basic Data Analytics using R (on real world datasets)
–   Create data subsets
–   Merge Data
–   Sort Data
–   Transposing Data
–   Melting Data to long format
–   Casting data to wide format
Data Preprocessing in R (on real world datasets)
–   Data cleaning
–   Data integration
–   Data transformation
–   Error correcting
The Exploratory data analysis (on real world datasets)
– Training and testing data
– Data cleaning
– Label encoding
– One Hot encoding

Module-4

Forecasting Numeric Data – Regression Methods
– Understanding regression
– Simple linear regression
– Multiple linear regression
– Example – predicting medical expenses using linear regression collecting data exploring and preparing the data
– Exploring relationships among features – the correlation matrix
– Visualizing relationships among features – the scatterplot matrix
– Training a model on the data
– Evaluating model performance
– Improving model performance
Lazy Learning – Classification Using Nearest Neighbors
– Understanding nearest neighbor classification
– The k-NN algorithm
– Measuring similarity with distance
– Choosing an appropriate k
– Preparing data for use with k-NN
– Why is the k-NN algorithm lazy?
– Example – diagnosing breast cancer with the k-NN algorithm
– Collecting data
– Exploring and preparing the data
– Transformation – normalizing numeric data
– Data preparation – creating training and test datasets
– Training a model on the data
– Evaluating model performance
– Improving model performance

Module-5

Divide and Conquer – Classification Using Decision Trees & Rules
– Understanding decision trees
– Divide and conquer
– The C5.0 decision tree algorithm
– Choosing the best split
– Pruning the decision tree
– Example – identifying risky bank loans using C5.0 decision trees
– Collecting data
– Exploring and preparing the data
– Data preparation – creating random training and test datasets
– Training a model on the data
– Evaluating model performance
– Improving model performance
Finding Patterns – Market Basket Analysis with Association Rules
– Understanding association rules
– The Apriori algorithm for association rule learning
– Measuring rule interest – support and confidence
– Building a set of rules with the Apriori principle
– Example – identifying frequently purchased groceries with association rules
– Collecting data
– Exploring and preparing the data
– Data preparation – creating a sparse matrix for transaction data
– Visualizing item support – item frequency plots
– Visualizing the transaction data – plotting the sparse matrix
– Training a model on the data
– Evaluating model performance
– Improving model performance
Finding Groups of Data – Clustering with k-means
– Understanding clustering
– Clustering as a machine learning task
– The k-means clustering algorithm
– Using distance to assign and update clusters
– Choosing the appropriate number of clusters
– Example – finding teen market segments using k-means clustering
– Collecting data
– Exploring and preparing the data
– Data preparation – dummy coding missing values
– Data preparation – imputing the missing values
– Training a model on the data
– Evaluating model performance
– Improving model performance