Hands-on data science and Python machine learning : perform data mining and machine learning efficiently using Python and Spark /

This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark. About This Book Take your first steps in the world of data science by understanding the tools and techniques of data analysis...

Deskribapen osoa

Xehetasun bibliografikoak
Egile nagusia:	Kane, Frank (Egilea)
Formatua:	Licensed eBooks
Hizkuntza:	ingelesa
Argitaratua:	Birmingham, UK : Packt Publishing, 2017.
Sarrera elektronikoa:	https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=1566405

Aurkibidea:

Intro
Copyright
Credits
About the Author
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Started
Installing Enthought Canopy
Giving the installation a test run
If you occasionally get problems opening your IPNYB files
Using and understanding IPython (Jupyter) Notebooks
Python basics
Part 1
Understanding Python code
Importing modules
Data structures
Experimenting with lists
Pre colon
Post colon
Negative syntax
Adding list to list
The append function
Complex data structures
Dereferencing a single element
The sort function
Reverse sort
Tuples
Dereferencing an element
List of tuples
Dictionaries
Iterating through entries
Python basics
Part 2
Functions in Python
Lambda functions
functional programming
Understanding boolean expressions
The if statement
The if-else loop
Looping
The while loop
Exploring activity
Running Python scripts
More options than just the IPython/Jupyter Notebook
Running Python scripts in command prompt
Using the Canopy IDE
Summary
Chapter 2: Statistics and Probability Refresher, and Python Practice
Types of data
Numerical data
Discrete data
Continuous data
Categorical data
Ordinal data
Mean, median, and mode
Mean
Median
The factor of outliers
Mode
Using mean, median, and mode in Python
Calculating mean using the NumPy package
Visualizing data using matplotlib
Calculating median using the NumPy package
Analyzing the effect of outliers
Calculating mode using the SciPy package
Some exercises
Standard deviation and variance
Variance
Measuring variance
Standard deviation
Identifying outliers with standard deviation
Population variance versus sample variance
The Mathematical explanation.
Analyzing standard deviation and variance on a histogram
Using Python to compute standard deviation and variance
Try it yourself
Probability density function and probability mass function
The probability density function and probability mass functions
Probability density functions
Probability mass functions
Types of data distributions
Uniform distribution
Normal or Gaussian distribution
The exponential probability distribution or Power law
Binomial probability mass function
Poisson probability mass function
Percentiles and moments
Percentiles
Quartiles
Computing percentiles in Python
Moments
Computing moments in Python
Summary
Chapter 3: Matplotlib and Advanced Probability Concepts
A crash course in Matplotlib
Generating multiple plots on one graph
Saving graphs as images
Adjusting the axes
Adding a grid
Changing line types and colors
Labeling axes and adding a legend
A fun example
Generating pie charts
Generating bar charts
Generating scatter plots
Generating histograms
Generating box-and-whisker plots
Try it yourself
Covariance and correlation
Defining the concepts
Measuring covariance
Correlation
Computing covariance and correlation in Python
Computing correlation
The hard way
Computing correlation
The NumPy way
Correlation activity
Conditional probability
Conditional probability exercises in Python
Conditional probability assignment
My assignment solution
Bayes' theorem
Summary
Chapter 4: Predictive Models
Linear regression
The ordinary least squares technique
The gradient descent technique
The co-efficient of determination or r-squared
Computing r-squared
Interpreting r-squared
Computing linear regression and r-squared using Python
Activity for linear regression.
Polynomial regression
Implementing polynomial regression using NumPy
Computing the r-squared error
Activity for polynomial regression
Multivariate regression and predicting car prices
Multivariate regression using Python
Activity for multivariate regression
Multi-level models
Summary
Chapter 5: Machine Learning with Python
Machine learning and train/test
Unsupervised learning
Supervised learning
Evaluating supervised learning
K-fold cross validation
Using train/test to prevent overfitting of a polynomial regression
Activity
Bayesian methods
Concepts
Implementing a spam classifier with Naïve Bayes
Activity
K-Means clustering
Limitations to k-means clustering
Clustering people based on income and age
Activity
Measuring entropy
Decision trees
Concepts
Decision tree example
Walking through a decision tree
Random forests technique
Decision trees
Predicting hiring decisions using Python
Ensemble learning
Using a random forest
Activity
Ensemble learning
Support vector machine overview
Using SVM to cluster people by using scikit-learn
Activity
Summary
Chapter 6: Recommender Systems
What are recommender systems?
User-based collaborative filtering
Limitations of user-based collaborative filtering
Item-based collaborative filtering
Understanding item-based collaborative filtering
How item-based collaborative filtering works?
Collaborative filtering using Python
Finding movie similarities
Understanding the code
The corrwith function
Improving the results of movie similarities
Making movie recommendations to people
Understanding movie recommendations with an example
Using the groupby command to combine rows
Removing entries with the drop command
Improving the recommendation results
Summary.
Chapter 7: More Data Mining and Machine Learning Techniques
K-nearest neighbors
concepts
Using KNN to predict a rating for a movie
Activity
Dimensionality reduction and principal component analysis
Dimensionality reduction
Principal component analysis
A PCA example with the Iris dataset
Activity
Data warehousing overview
ETL versus ELT
Reinforcement learning
Q-learning
The exploration problem
The simple approach
The better way
Fancy words
Markov decision process
Dynamic programming
Summary
Chapter 8: Dealing with Real-World Data
Bias/variance trade-off
K-fold cross-validation to avoid overfitting
Example of k-fold cross-validation using scikit-learn
Data cleaning and normalisation
Cleaning web log data
Applying a regular expression on the web log
Modification one
filtering the request field
Modification two
filtering post requests
Modification three
checking the user agents
Filtering the activity of spiders/robots
Modification four
applying website-specific filters
Activity for web log data
Normalizing numerical data
Detecting outliers
Dealing with outliers
Activity for outliers
Summary
Chapter 9: Apache Spark
Machine Learning on Big Data
Installing Spark
Installing Spark on Windows
Installing Spark on other operating systems
Installing the Java Development Kit
Installing Spark
Spark introduction
It's scalable
It's fast
It's young
It's not difficult
Components of Spark
Python versus Scala for Spark
Spark and Resilient Distributed Datasets (RDD)
The SparkContext object
Creating RDDs
Creating an RDD using a Python list
Loading an RDD from a text file
More ways to create RDDs
RDD operations
Transformations
Using map()
Actions
Introducing MLlib.
Some MLlib Capabilities
Special MLlib data types
The vector data type
LabeledPoint data type
Rating data type
Decision Trees in Spark with MLlib
Exploring decision trees code
Creating the SparkContext
Importing and cleaning our data
Creating a test candidate and building our decision tree
Running the script
K-Means Clustering in Spark
Within set sum of squared errors (WSSSE)
Running the code
TF-IDF
TF-IDF in practice
Using TF- IDF
Searching wikipedia with Spark MLlib
Import statements
Creating the initial RDD
Creating and transforming a HashingTF object
Computing the TF-IDF score
Using the Wikipedia search engine algorithm
Running the algorithm
Using the Spark 2.0 DataFrame API for MLlib
How Spark 2.0 MLlib works
Implementing linear regression
Summary
Chapter 10: Testing and Experimental Design
A/B testing concepts
A/B tests
Measuring conversion for A/B testing
How to attribute conversions
Variance is your enemy
T-test and p-value
The t-statistic or t-test
The p-value
Measuring t-statistics and p-values using Python
Running A/B test on some experimental data
When there's no real difference between the two groups
Does the sample size make a difference?
Sample size increased to six-digits
Sample size increased seven-digits
A/A testing
Determining how long to run an experiment for
A/B test gotchas
Novelty effects
Seasonal effects
Selection bias
Auditing selection bias issues
Data pollution
Attribution errors
Summary
Index.

Hands-on data science and Python machine learning : perform data mining and machine learning efficiently using Python and Spark /

Antzeko izenburuak