Get answers and suggestions for various questions from here

Random Forest - a tool to achieve Top 2% in Kaggle


Said in the previous words : This article describes the kaggle practice of the Titanic survival prediction, which is mainly divided into two parts: First, according to experience, select important variables for primary prediction, the prediction accuracy is 0.77990, and the result ranks around 50%; Second, learn the idea of ​​kaggle cattle T, use random forest for prediction, the prediction accuracy is 0.81340, so that the ranking has entered Top 7%; Third, based on the original code, variable optimization, ranking 127, improve ranking to Top2% . Welcome everyone to give advice on bricks~~

Prospect review:

The friends who have seen the movie know that the Titanic accidentally hit the iceberg and sank while carrying out the maiden voyage from England to New York. In this disaster, people scrambled to escape the ship that was sinking. “Ms. and Children’s Priority” is the implementation of the guidelines in this disaster. Due to the insufficient number of lifeboats, only a small number of passengers survived. So, who is surviving? This is our next task: use R to predict who is alive.

In Kaggle, we can get two data sets:

A training set that includes the escape results of a group of passengers (ie, target variables) and other parameters for each passenger, such as gender, age, and so on. We will train our model on this dataset.

A test set (test) that provides the same non-target variables, but the target variables in the test set are not provided. We must predict the value of the target variable (ie, whether it flees) based on the non-target variables in the test set.

1. Data import

You can click on “Import Dataset” to import the dataset directly; you can also use the read function to enter the code:

Train <- read.csv("C:/Users/Administrator/Desktop/Titanic/train.csv", stringsAsFactors = F)

Test <- read.csv("C:/Users/Administrator/Desktop/Titanic/test.csv", stringsAsFactors = F)



From the above figure we can conclude that there are 891 observations (rows) in the training set and 12 variables per observation. The test set is small, the fate of only 418 passengers needs to be predicted, and there are only 11 variables, which is missing from the "Survived" column. In addition, we can also derive the type of each variable (int, num, chr, etc.).

Below, we will explain the variables:

PassengerId: The number of the passenger, making a distinction.

Survived: Passenger surviving is the target variable we predict.

Pclass: Class of cabin, grades 1, 2, and 3.

Name: name;

Sex: gender;

Age: age;

SibSp: number of siblings or spouses on board;

Parch: number of parents or children on board;

Ticket: the ticket number;

Fare: the ticket price;

Cabin: cabin number;

Embarked: boarding port.

2. Data processing and preliminary prediction

First, let's first look at the passenger survival rate in the test set:

From the test set train, we can use the table (train $Survived) command to extract the passengers in the training set. To make the data more readable, use the prop.table(table(train$Survived)) command. Can directly derive the survival rate of passengers:

As can be seen from the figure, in the training set, the passenger survival rate is about 38% (342/891), which means that about 62% (549/891) of passengers are killed.

So, in this test set, there are only 481 passengers. Will those passengers survive? Is it none? still is? ? Let's make a preliminary prediction:

2.1 Gender impact

We know that in this disaster, the guidelines for implementation are “Ms. and Children First”, so let's first analyze the gender in the training set. As above, use the table (train $Sex) command to extract the gender of the passengers in the training set. To make the data more readable, use the prop.table(table(train$Sex)) command to get the gender ratio:

From the above picture we can find that the proportion of female passengers is about 35%. Therefore, most passengers are male. We extend the last used scale table command table(train$Sex, train$Survived) to get a two-dimensional list of male and female ratios and survival, then use the command prop.table(table(train$Sex, train$Survived),1 ), respectively, to obtain the survival ratio of male and female passengers.

From the above figure, we can see that in the training set, the vast majority of female passengers (233/314) survived, and a small percentage of male passengers survived (109/577).

2.2 Age effects

Considering age also has an impact on survival rates, such as children's priority. So let's analyze it, use the summary(train$Age) command to get:

Statistics were made on the age of the passengers and found to have 177 missing values. Here, we can assume that these 177 unknown passengers are adults over the age of 18. According to the International Convention on the Rights of the Child, a child is anyone under the age of 18. To this end, we can analyze the survival of children:

First, add a variable named Child and assign a value of 0 as a whole; then assign a value of 1.

Train$Child <- 0

Train$Child[train$Age < 18] <- 1

Here, we use the aggregate command to analyze the child's survival:

As can be seen from the above figure, the survival rate of male passengers is low (<40). Whether it is children or not, female passengers, whether children or adults, have a high survival rate and are relatively close. Therefore, the initial judgment of age has little effect on the passenger's survival rate.

2.3 The impact of fares

Similarly, we add a variable Fare2 to classify fares, mainly in four intervals: 30+, 20-30, 10-20, <10 USD. Enter the code:

Train$Fare2 <- '30+'

Train$Fare2[train$Fare < 30 & train$Fare >= 20] <- '20-30'

Train$Fare2[train$Fare < 20 & train$Fare >= 10] <- '10-20'

Train$Fare2[train$Fare < 10] <- '<10'

Then use the aggregate command to analyze the passenger's survival under the influence of passenger economic status, ticket price and gender:

From the above picture, we can conclude that the survival rate of female passengers is much higher than that of men, no matter what level of cabin, especially for women in 1, 2 and 2 cabins, the highest survival rate. The one exception is that for those third-class cabins, female passengers with a ticket fee of 20-30, 30+, for some reason, have a low survival rate (only 33.3%, 12.5%). Therefore, we may wish to make a preliminary judgment: in this disaster, the surviving passengers were basically women who were in the cabins of 1, 2 and so on, and a small number of women who were in the third class cabin and paid $10 for the boat:

#Add the test set variable Survived, and the initial assignment is 0

Test$Survived <- 0

#Change the Survived value of all women in the test set to 1

Test$Survived[test$Sex == 'female'] <- 1

#将测试的三等舱舱, the survived value of women with ticket fees 20-30, 30+ is changed to 0

Test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare >= 20] <- 0

#Generate a new data frame that matches the uploaded Kaggle

Submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

#output data frame in csv format

Write.csv(submit, file = "first submssion.csv", row.names = FALSE)

Kaggle upload results:

Rank 3554/7146, 哎哎哎~~ It’s not that simple. Go and see what model predictions the big gods use, I will come back again! ! !

—————————————— The dividing line of the first counterattack —————————————————

I said that I came back to work after work in the past few days and found a teaching post written by a great god. I won my heart. After the code is running, upload the result to kaggle:

Rank 417! ! The first 7%! I am going, what is it so shiny? ——— Random forest! ! ! ! ! , it is ~ it is ~ it is ~

First, let's take a look at the random forest (from the random forest - emanlee - blog garden ):

Random Forest is a versatile machine learning algorithm that performs the tasks of regression and classification. At the same time, it is also a data dimension reduction method for dealing with missing values, outliers and other important steps in data exploration, and has achieved good results. In addition, it serves as an important method in integrated learning, and is used to integrate several inefficient models into a highly efficient model.

In a random forest, we will generate a lot of decision trees, not just generating a unique tree as in the CART model. When classifying a new object based on certain attributes, each tree in the random forest will give its own classification choice, and thus “vote”, the overall output of the forest will be the most votes. The classification option; in the regression problem, the output of the random forest will be the average of all decision tree outputs.

1. How does the random forest algorithm work?

In a random forest, the rules for “planting” and “growing” each decision tree are as follows:

1. Suppose we set the number of samples in the training set to N, and then obtain the N samples by repeated multiple sampling with reset. Such sampling results will be used as the training set for our decision tree;

2. If there are M input variables, each node will randomly select m (m < M) specific variables, and then use these m variables to determine the best split point. During the generation of the decision tree, the value of m remains unchanged;

3. Each decision tree grows as much as possible without pruning;

4. Predict new data by summing all decision trees (using a majority vote in the classification and an average in the regression).

2. Advantages and disadvantages of random forests


1. As mentioned above, the random forest algorithm can solve both classification and regression problems, and has a fairly good estimation performance in both aspects;

2. Random Forest's ability to process high-dimensional data sets is exciting. It can process thousands of input variables and determine the most important variables, so it is considered a good dimension reduction method. In addition, the model is able to output the importance of the variable, which is a very convenient feature. The following figure shows the output form of random forests for the importance of variables:

3. Random forests are a very effective method for estimating missing data. Even if there is a large amount of data missing, random forests can better maintain accuracy;

4. When there is a situation of unbalanced classification, the random forest can provide an effective method to balance the error of the data set;

5. The above performance of the model can be extended to unmarked data sets for guiding unsupervised clustering, pivoting and anomaly detection;

6. The random forest algorithm contains a repeated self-sampling process for the input data, the so-called bootstrap sampling. As a result, approximately one-third of the data set will not be used for training of the model but for testing. Such data is called out of bag samples, and the error estimated by these samples is called out of bag error. Studies have shown that this out of bag method has the same degree of accuracy as the test set size is consistent with the training set, so we do not need to make additional settings for the test set in the random forest.


1. The random forest does not perform as well as it does in the classification when it solves the regression problem because it does not give a continuous output. When regression is performed, random forests are not able to make predictions beyond the training set data range, which may result in overfitting when modeling certain data with specific noise.

2. For many statistical modelers, random forests feel like a black box—you can hardly control the internals of the model, and you can only try between different parameters and random seeds.

In short, a random forest is a classifier that uses multiple decision trees to train and predict samples. Want to quickly understand the random forests can also look at the science randomized Forest Law and .

Ok, now take the code to practice it:

#Load data

Train <- read.csv("train.csv")test <- read.csv("test.csv")

#Load the required decision tree and the random forest related package




- factor(combi$Title) # Form a new variable "Family size", calculated from the variables "SibSp" and "Parch". The variable meaning is family size. Combi$FamilySize <- combi$SibSp + combi$Parch + 1

# Form new variables "Surname", "FamilyID", and classify assignment #Separate
combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]') [[1]][1]})

#Form a new variable "FamilyID", and assign FamilySize <= 2 to Smallcombi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")
combi$FamilyID[combi$FamilySize <= 2] <- 'Small'

# Delete the wrong famIDsfamIDs <- data.frame(table(combi$FamilyID))famIDs <- famIDs[famIDs$Freq <= 2,]combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'

# Convert variables to factor format

Combi$FamilyID <- factor(combi$FamilyID)

# Fill in missing values ​​of Age variable

Summary(combi$Age)Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data=combi[!$Age),], method="anova" )combi$Age[$Age)] <- predict(Agefit, combi[$Age),])

# Check for other missing values ​​in the dataset


# Fill in missing values ​​for Embarked variables

Summary(combi$Embarked)which(combi$Embarked == '')combi$Embarked[c(62,830)] = "S"combi$Embarked <- factor(combi$Embarked)

# Fill in the missing value of the Fare variable
summary(combi $Fare)which($Fare))combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)

#Split back into test and train sets# divides the combi data set into tests Set and training set

Train <- combi[1:891,]test <- combi[892:1309,]

# Build Random Forest Ensemble#Build a random forest model

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,data = train, controls=cforest_unbiased(ntree=2000, Mtry=3))

# Look at variable importance View the importance of variables

# predict and output data table

Prediction <- predict(fit, test, OOB=TRUE, type = "response")submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)write.csv(submit, file = "forest-5.csv ", row.names = FALSE)

Variable importance output results:

MeanDecreaseAccuracy describes the degree of reduction in random forest prediction accuracy when a variable is turned into a random number. The larger the value, the greater the importance of the variable. MeanDecreaseGini calculates the heterogeneity effect of each variable on the observations of each node on the classification tree by the Gini index. A larger value indicates a greater importance of the variable.

As can be seen from the above figure, variables such as fare, class (Pclass) and gender (Sex) do play a key role in the escape prediction of this catastrophe, and the newly generated variable title is the most important. of.

In this random forest model, the original variables in the dataset are used: cabin class (Pclass), gender (Sex), age (Age), siblings on board or spouse (SibSp), parents on board or number of children (Parch). The fare (Fare) and the embarked port (Embarked) also use the newly generated variable title (Title), family size (SiteSize) and family ID (Family ID).

———————————— Separation line of the second counterattack ————————————————————

Come back from work, there are new discoveries ( ), the Ticket variable is analyzed here. Since the repetition of the Ticket variable is very low and cannot be directly used, the number of passengers corresponding to each ticket must be counted first:

Ticket.count <- aggregate(combi$Ticket, by = list(combi$Ticket), function(x) sum(!

Let us imagine that a passenger with the same ticket number is a family and is likely to survive or die at the same time. All passengers are now divided into two groups according to the Ticket. One group uses a separate ticket number, the other group shares the ticket number with others, and counts the number of survivors and victims in each group.

Combi$TicketCount <- apply(combi, 1, function(x) ticket.count[which(ticket.count[, 1] == x['Ticket']), 2])

Combi$TicketCount <- factor(sapply(combi$TicketCount, function(x) ifelse(x > 1, 'Share', 'Unique')))



Fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount ,

Data = train, controls=cforest_unbiased(ntree=2000, mtry=3))

Prediction <- predict(fit, test, OOB=TRUE, type = "response")

Submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)

Write.csv(submit, file = "forest-11.csv", row.names = FALSE)

Upload to Kaggle:

The ranking rose to 127 and entered Top2%.

Summary: This time, the R language has been going on for a long time. To be honest, there is still a sense of unclearness about the classifier of random forests. Some of these statements have learned from all walks of life. The meaning and use of the statements still have to be deepened. Digestion and consolidation. It is quite interesting to see the English posts written by the great gods on the Internet. Their answers are very enthusiastic and sincere.