Skip to main content

Machine Learning Basics

data collection

  • primary objective is to identify and gather data we intend to use for machine learning
    • check data accuracy

data exploration

  • the process of describing, visualizing, and analyzing data in order to better understand it allows us to answer questions such as
    • how many rows and columns are in the data
    • what type of data do we have
    • are there missing, inconsistent or duplicate values in the data

data preparation

  • the process of making sure that our data (by modifying it) is suitable for the machine learning approach that we choose to use
    • check for missing data (changes in data, human error, bias, lack of reliable input)
    • normalizing data: ensures that values share a common property
      • involves scaling data to fall within a small or specified range
      • often required, reduces complexity, improves interpretability
    • sampling data:
      • the process of selecting a subset of the instances in a dataset as a proxy for the whole
      • original dataset is referred to as population, subset is sample
    • dimensionality reduction:
      • the process of reducing the number of features in a dataset prior to modeling
        • helps to reduce time and storage required to process data
        • improves data visualization and model interpretability
      • reduces complexity and helps avoid the curse of dimensionality
      • feature selection:
        • the process of identifying the minimal set of features needed to build a good model
        • also known as variable subset selection
      • feature extraction:
        • use of mathematical functions to transform high-dimensional data into lower dimension
        • also known as feature projection

modeling

  • involves choosing and applying the appropriate machine learning approach that works well with the data we have and solves the problem that we intend to solve
    • in supervised ML: objective is to build a model that maps a given input (which we call the independent variables) to the given output (which we call the dependent variable)
      • depending on the nature of the dependent variable, problem can be either be called Classification or Regression
        • Classification: if dependent variable is a categorical value (e.g.: color, yes or no, the weather)
        • Regression: if we intend to predict a continuous value (e.g.: age, income, temperature)
          • ml algo that solves only regression
            • logistic regression, simple linear regression, multiple linear regression, poisson regression, polynomial regression
    • ML algo that can solve both Classification and regression problems
      • descision tree, Naive Bayes, neural networks, k-nearest neighbors, support vector machines

evaluation

  • after training a ML model, important to see how well suited it is to the problem at hand
  • in order to get an unbiased evaluation of the performance of our model, we must train the model with a different dataset (training data, and test data) from the one we use to evaluate it

actionable insight