Data Science Foundations: Data Assessment for Predictive Modeling

Data Science Foundations: Data Assessment for Predictive Modeling

CRISP-DM, the cross-industry standard process for data mining, is composed of six phases. Most new data scientists rush to modeling because it's the phase in which they have the most training. But whether the project succeeds or fails is actually determined far earlier. This course introduces a systematic approach to the data understanding phase for predictive modeling. Instructor Keith McCormick teaches principles, guidelines, and tools, such as KNIME and R, to properly assess a data set for its suitability for machine learning. Discover how to collect data, describe data, explore data by running bivariate visualizations, and verify your data quality, as well as make the transition to the data preparation phase. The course includes case studies and best practices, as well as challenge and solution sets for enhanced knowledge retention. By the end, you should have the skills you need to pay proper attention to this vital phase of all successful data science projects.

Topics include:

  • Distinguishing data assessment from data viz
  • Mastering the four data understanding tasks
  • Collecting initial data
  • Identifying the level of measurement
  • Loading data
  • Describing data
  • Visualizing data
  • Working with top predictors
  • Using ggplot2 for data viz
  • Verifying data quality
  • Transitioning to data preparation

课程信息

  • 英文名称:Data Science Foundations: Data Assessment for Predictive Modeling
  • 时长:4小时3分
  • 字幕:英语

课程目录

  1. Why data assessment is critical
  2. A note about the exercise files
  3. Clarifying how data understanding differs from data visualization
  4. Introducing the critical data understanding phase of CRISP-DM
  5. Data assessment in CRISP-DM alternatives: The IBM ASUM-DM and Microsoft TDSP
  6. Navigating the transition from business understanding to data understanding
  7. How to organize your work with the four data understanding tasks
  8. Considerations in gathering the relevant data
  9. A strategy for processing data sources
  10. Getting creative about data sources
  11. How to envision a proper flat file
  12. Anticipating data integration
  13. Reviewing basic concepts in the level of measurement
  14. What is dummy coding?
  15. Expanding our definition of level of measurement
  16. Taking an initial look at possible key variables
  17. Dealing with duplicate IDs and transactional data
  18. How many potential variables (columns) will I have?
  19. How to deal with high-order multiple nominals
  20. Challenge: Identifying the level of measurement
  21. Solution: Identifying the level of measurement
  22. Introducing the KNIME Analytics Platform
  23. Tips and tricks to consider during data loading
  24. Unit analysis decisions
  25. Challenge: What should the row be?
  26. Solution: What should the row be?
  27. How to uncover the gross properties of the data
  28. Researching the dataset
  29. Tips and tricks using simple aggregation commands
  30. A simple strategy for organizing your work
  31. Describe data demo using the UCI heart dataset
  32. Challenge: Practice describe data with the UCI heart dataset
  33. Solution: Practice describe data with the UCI heart dataset
  34. The explore data task
  35. How to be effective doing univariate analysis and data visualization
  36. Anscombe's quartet
  37. The Data Explorer node feature in KNIME
  38. How to navigate borderline cases of variable type
  39. How to be effective in doing bivariate data visualization
  40. Challenge: Producing bivariate visualizations for case study 1
  41. Solution: Producing bivariate visualizations for case study 1
  42. How to utilize an SME's time effectively
  43. Techniques for working with the top predictors
  44. Advice for weak predictors
  45. Tips and tricks when searching for quirks in your data
  46. Learning when to discard rows
  47. Introducing ggplot2
  48. Orientating to R's ggplot2 for powerful multivariate data visualizations
  49. Challenge: Producing multivariate visualizations for case study 1
  50. Solution: Producing multivariate visualizations for case study 1
  51. Exploring your missing data options
  52. Why you lose rows to listwise deletion
  53. Investigating the provenance of the missing data
  54. Introducing the KDD Cup 1998 data
  55. What is the pattern of missing data in your data?
  56. Is the missing data worth saving?
  57. Assessing imputation as a potential solution
  58. Exploring and verifying data quality with the UCI heart dataset
  59. Challenge: Quantifying missing data with the UCI heart dataset
  60. Solution: Quantifying missing data with the UCI heart dataset
  61. Why formal reports are important
  62. Creating a data prep to-do list
  63. How to prepare for eventual deployment
  64. Next steps

评论