![Python:Advanced Predictive Analytics](https://wfqqreader-1252317822.image.myqcloud.com/cover/356/36700356/b_36700356.jpg)
Creating dummy variables
Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.
In our dataset, sex
is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:
dummy_sex=pd.get_dummies(data['sex'],prefix='sex')
The result of this statement is, as follows:
![](https://epubservercos.yuewen.com/ACC31E/19470399408915306/epubprivate/OEBPS/Images/B01782_02_17.jpg?sign=1739301590-g1ADxUMCScXd3DCdc07txdGfxldaLmFL-0-90f2eb8929b360a8b8a67e0db7773219)
Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset
This process is called dummifying, the variable creates two new variables that take either 1
or 0
value depending on what the sex of the passenger was. If the sex was female, sex_female
would be 1
and sex_male
would be 0
. If the sex was male, sex_male
would be 1
and sex_female
would be 0
. In general, all but one dummy variable in a row will have a 0
value. The variable derived from the value (for that row) in the original column will have a value of 1
.
These two new variables can be joined to the source data frame, so that they can be used in the models. The method to that is illustrated, as follows:
column_name=data.columns.values.tolist() column_name.remove('sex') data[column_name].join(dummy_sex)
The column names are converted to a list and the sex is removed from the list before joining these two dummy variables to the dataset, as it will not make sense to have a sex variable with these two dummy variables.