Hands-on data science and Python machine learning : perform data mining and machine learning efficiently using Python and Spark /
This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark. About This Book Take your first steps in the world of data science by understanding the tools and techniques of data analysis...
Egile nagusia: | |
---|---|
Formatua: | Licensed eBooks |
Hizkuntza: | ingelesa |
Argitaratua: |
Birmingham, UK :
Packt Publishing,
2017.
|
Sarrera elektronikoa: | https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=1566405 |
Aurkibidea:
- Intro
- Copyright
- Credits
- About the Author
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Getting Started
- Installing Enthought Canopy
- Giving the installation a test run
- If you occasionally get problems opening your IPNYB files
- Using and understanding IPython (Jupyter) Notebooks
- Python basics
- Part 1
- Understanding Python code
- Importing modules
- Data structures
- Experimenting with lists
- Pre colon
- Post colon
- Negative syntax
- Adding list to list
- The append function
- Complex data structures
- Dereferencing a single element
- The sort function
- Reverse sort
- Tuples
- Dereferencing an element
- List of tuples
- Dictionaries
- Iterating through entries
- Python basics
- Part 2
- Functions in Python
- Lambda functions
- functional programming
- Understanding boolean expressions
- The if statement
- The if-else loop
- Looping
- The while loop
- Exploring activity
- Running Python scripts
- More options than just the IPython/Jupyter Notebook
- Running Python scripts in command prompt
- Using the Canopy IDE
- Summary
- Chapter 2: Statistics and Probability Refresher, and Python Practice
- Types of data
- Numerical data
- Discrete data
- Continuous data
- Categorical data
- Ordinal data
- Mean, median, and mode
- Mean
- Median
- The factor of outliers
- Mode
- Using mean, median, and mode in Python
- Calculating mean using the NumPy package
- Visualizing data using matplotlib
- Calculating median using the NumPy package
- Analyzing the effect of outliers
- Calculating mode using the SciPy package
- Some exercises
- Standard deviation and variance
- Variance
- Measuring variance
- Standard deviation
- Identifying outliers with standard deviation
- Population variance versus sample variance
- The Mathematical explanation.
- Analyzing standard deviation and variance on a histogram
- Using Python to compute standard deviation and variance
- Try it yourself
- Probability density function and probability mass function
- The probability density function and probability mass functions
- Probability density functions
- Probability mass functions
- Types of data distributions
- Uniform distribution
- Normal or Gaussian distribution
- The exponential probability distribution or Power law
- Binomial probability mass function
- Poisson probability mass function
- Percentiles and moments
- Percentiles
- Quartiles
- Computing percentiles in Python
- Moments
- Computing moments in Python
- Summary
- Chapter 3: Matplotlib and Advanced Probability Concepts
- A crash course in Matplotlib
- Generating multiple plots on one graph
- Saving graphs as images
- Adjusting the axes
- Adding a grid
- Changing line types and colors
- Labeling axes and adding a legend
- A fun example
- Generating pie charts
- Generating bar charts
- Generating scatter plots
- Generating histograms
- Generating box-and-whisker plots
- Try it yourself
- Covariance and correlation
- Defining the concepts
- Measuring covariance
- Correlation
- Computing covariance and correlation in Python
- Computing correlation
- The hard way
- Computing correlation
- The NumPy way
- Correlation activity
- Conditional probability
- Conditional probability exercises in Python
- Conditional probability assignment
- My assignment solution
- Bayes' theorem
- Summary
- Chapter 4: Predictive Models
- Linear regression
- The ordinary least squares technique
- The gradient descent technique
- The co-efficient of determination or r-squared
- Computing r-squared
- Interpreting r-squared
- Computing linear regression and r-squared using Python
- Activity for linear regression.
- Polynomial regression
- Implementing polynomial regression using NumPy
- Computing the r-squared error
- Activity for polynomial regression
- Multivariate regression and predicting car prices
- Multivariate regression using Python
- Activity for multivariate regression
- Multi-level models
- Summary
- Chapter 5: Machine Learning with Python
- Machine learning and train/test
- Unsupervised learning
- Supervised learning
- Evaluating supervised learning
- K-fold cross validation
- Using train/test to prevent overfitting of a polynomial regression
- Activity
- Bayesian methods
- Concepts
- Implementing a spam classifier with Naïve Bayes
- Activity
- K-Means clustering
- Limitations to k-means clustering
- Clustering people based on income and age
- Activity
- Measuring entropy
- Decision trees
- Concepts
- Decision tree example
- Walking through a decision tree
- Random forests technique
- Decision trees
- Predicting hiring decisions using Python
- Ensemble learning
- Using a random forest
- Activity
- Ensemble learning
- Support vector machine overview
- Using SVM to cluster people by using scikit-learn
- Activity
- Summary
- Chapter 6: Recommender Systems
- What are recommender systems?
- User-based collaborative filtering
- Limitations of user-based collaborative filtering
- Item-based collaborative filtering
- Understanding item-based collaborative filtering
- How item-based collaborative filtering works?
- Collaborative filtering using Python
- Finding movie similarities
- Understanding the code
- The corrwith function
- Improving the results of movie similarities
- Making movie recommendations to people
- Understanding movie recommendations with an example
- Using the groupby command to combine rows
- Removing entries with the drop command
- Improving the recommendation results
- Summary.
- Chapter 7: More Data Mining and Machine Learning Techniques
- K-nearest neighbors
- concepts
- Using KNN to predict a rating for a movie
- Activity
- Dimensionality reduction and principal component analysis
- Dimensionality reduction
- Principal component analysis
- A PCA example with the Iris dataset
- Activity
- Data warehousing overview
- ETL versus ELT
- Reinforcement learning
- Q-learning
- The exploration problem
- The simple approach
- The better way
- Fancy words
- Markov decision process
- Dynamic programming
- Summary
- Chapter 8: Dealing with Real-World Data
- Bias/variance trade-off
- K-fold cross-validation to avoid overfitting
- Example of k-fold cross-validation using scikit-learn
- Data cleaning and normalisation
- Cleaning web log data
- Applying a regular expression on the web log
- Modification one
- filtering the request field
- Modification two
- filtering post requests
- Modification three
- checking the user agents
- Filtering the activity of spiders/robots
- Modification four
- applying website-specific filters
- Activity for web log data
- Normalizing numerical data
- Detecting outliers
- Dealing with outliers
- Activity for outliers
- Summary
- Chapter 9: Apache Spark
- Machine Learning on Big Data
- Installing Spark
- Installing Spark on Windows
- Installing Spark on other operating systems
- Installing the Java Development Kit
- Installing Spark
- Spark introduction
- It's scalable
- It's fast
- It's young
- It's not difficult
- Components of Spark
- Python versus Scala for Spark
- Spark and Resilient Distributed Datasets (RDD)
- The SparkContext object
- Creating RDDs
- Creating an RDD using a Python list
- Loading an RDD from a text file
- More ways to create RDDs
- RDD operations
- Transformations
- Using map()
- Actions
- Introducing MLlib.
- Some MLlib Capabilities
- Special MLlib data types
- The vector data type
- LabeledPoint data type
- Rating data type
- Decision Trees in Spark with MLlib
- Exploring decision trees code
- Creating the SparkContext
- Importing and cleaning our data
- Creating a test candidate and building our decision tree
- Running the script
- K-Means Clustering in Spark
- Within set sum of squared errors (WSSSE)
- Running the code
- TF-IDF
- TF-IDF in practice
- Using TF- IDF
- Searching wikipedia with Spark MLlib
- Import statements
- Creating the initial RDD
- Creating and transforming a HashingTF object
- Computing the TF-IDF score
- Using the Wikipedia search engine algorithm
- Running the algorithm
- Using the Spark 2.0 DataFrame API for MLlib
- How Spark 2.0 MLlib works
- Implementing linear regression
- Summary
- Chapter 10: Testing and Experimental Design
- A/B testing concepts
- A/B tests
- Measuring conversion for A/B testing
- How to attribute conversions
- Variance is your enemy
- T-test and p-value
- The t-statistic or t-test
- The p-value
- Measuring t-statistics and p-values using Python
- Running A/B test on some experimental data
- When there's no real difference between the two groups
- Does the sample size make a difference?
- Sample size increased to six-digits
- Sample size increased seven-digits
- A/A testing
- Determining how long to run an experiment for
- A/B test gotchas
- Novelty effects
- Seasonal effects
- Selection bias
- Auditing selection bias issues
- Data pollution
- Attribution errors
- Summary
- Index.