Pay someone to write your paper and get a speedy homework service. Research paper writing. Term paper writing. Do my homework. Help my essay. Write my research paper service.
Order my paper1.) R for Recency => months since last donation, 2.) F for Frequency => total number of donation, 3.) M for Monetary => total amount of blood donated in c.c., 4.) T for Time => months since first donation and 5.) Binary variable => 1 > donated blood, 0> didn’t donate blood.
The main idea behind this dataset is the concept of relationship management CRM. Based on three metrics: Recency, Frequency and Monetary (RFM) which are 3 out of the 5 attributes of the dataset, we would be able to predict whether a customer is likely to donate blood again based to a marketing campaign. For example, customers who have donated or visited more currently (Recency), more frequently (Frequency) or made higher monetary values (Monetary) are more likely to respond to a marketing effort. Customers with less RFM score are less likely to react. It is also known in customer behavior, that the time of the first positive interaction (donation, purchase) is not significant. However, the Recency of the last donation is very important.
In the traditional RFM implementation each customer is ranked based on his RFM value parameters against all the other customers and that develops a score for every customer. Customers with bigger scores are more likely to react in a positive way for example (visit again or donate). The model constructs the formula which could predict the following problem.
Firstly, I created a .csv file and generated 748 unique random numbers in Excel in the domain [1,748] in the first column, which corresponds to the customers or users ID. Then I transferred the whole data from the .txt file (transfusion.data) to the .csv file in excel by using the delimited (‘,’) option. Then I randomly split it in a train file and a test file. The train file contains the 530 instances and the test file has the 218 instances. Afterwards, I read both the training dataset and the test dataset.
From the previous results, we can see that we have no missing or invalid values. Data ranges and units seem reasonable.
Figure 1 above depicts boxplots of all the attributes and for both train and test datasets. By examining the figure, we notice that both datasets have similar distributions and there are some outliers (Monetary > 2,500) that are visible. The volume of blood variable has a high correlation with frequency. Because the volume of blood that is donated each time is fixed, the Monetary value is proportional to the Frequency (number of donations) each person gave. For example, if the amount of blood drawn in each person was 250 ml/bag (Taiwan Blood Services Foundation 2007) March then Monetary = 250*Frequency. This is also why in the predictive model we will not consider the Monetary attribute in the implementation. So, it is reasonable to expect that customers with higher frequency will have a lot higher Monetary value. This can be verified also visually by examining the Monetary outliers for the train set. We retrieve back 83 instances.
In order, to understand better the statistical dispersion of the whole dataset (748 instances) we will look at the standard deviation (SD) between the Recency and the variable ‘whether customer has donated blood’ (Binary variable) and the SD between the Frequency and the Binary variable.The distribution of scores around the mean is small, which means the data is concentrated. This can also be noticed from the plots.

Another observation is that the various Recency numbers are not factors of 3. This goes to opposition with what the description said about the data being collected every 3 months. Additionally, there is always a maximum number of times you can donate blood per certain period (e.g. 1 time per month), but the data shows that.
36 customers donated blood more than once and 6 customers had donated 3 or more times in the same month.
The features that will be used to calculate the prediction of whether a customer is likely to donate again are 2, the Recency and the Frequency (RF). The Monetary feature will be dropped. The number of categories for R and F attributes will be 3. The highest RF score will be 33 equivalent to 6 when added together and the lowest will be 11 equivalent to 2 when added together. The threshold for the added score to determine whether a customer is more likely to donate blood again or not, will be set to 4 which is the median value. The users will be assigned to categories by sorting on RF attributes as well as their scores. The file with the donators will be sorted on Recency first (in ascending order) because we want to see which customers have donated blood more recently. Then it will be sorted on frequency (in descending order this time because we want to see which customers have donated more times) in each Recency category. Apart from sorting, we will need to apply some business rules that have occurred after multiple tests:
RESULTS
The output of the program are two smaller files that have resulted from the train file and the other one from the test file, that have excluded several customers that should not be considered future targets and kept those that are likely to respond. Some statistics about the precision, recall and the balanced Fscore of the train and test file have been calculated and printed. Furthermore, we compute the absolute difference between the results retrieved from the train and test file to get the offset error between these statistics. By doing this and verifying that the error numbers are negligible, we validate the consistency of the model implemented. Moreover, we depict two confusion matrices one for the test and one for the training by calculating the true positives, false negatives, false positives and true negatives. In our case, true positives correspond to the customers (who donated on March 2007) and were classified as future possible donators. False negatives correspond to the customers (who donated on March 2007) but were not classified as future possible targets for marketing campaigns. False positives correlate to customers (who did not donate on March 2007) and were incorrectly classified as possible future targets. Lastly, true negatives which are customers (who did not donate on March 2007) and were correctly classified as not plausible future donators and therefore removed from the data file. By classification we mean the application of the threshold (4) to separate those customers who are more likely and less likely to donate again in a certain future period.
Lastly, we calculate 2 more single value metrics for both train and test files the Kappa Statistic (general statistic used for classification systems) and Matthews Correlation Coefficient or cost/reward measure. Both are normalized statistics for classification systems, its values never exceed 1, so the same statistic can be used even as the number of observations grows. The error for both measures are MCC error: 0.002577Â and Kappa error:Â 0.002808, which is very small (negligible), similarly with all the previous measures.
REFERENCES
The Appendix with the code starts below. However the whole code has been uploaded on my Git Hub profile and this is the link where it can be accessed.
https://github.com/it21208/RassignmentDataAnalysis/blob/master/RassignmentDataAnalysis.R
Â # read training and testing datasets
# assigning the datasets to dataframes
# give better names to columns
#——————————————————
# drop time column from both files
#Â sort (train) dataframe on Recency in ascending order
#Â add column in (train) dataframe Â hold score (rank) of Recency for each customer
#Â convert train file from dataframe format to matrix
#Â sort (test) dataframe on Recency in ascending order
#Â add column in (test) dataframe hold score (rank) of Recency for each customer
#Â convert train file from dataframe format to matrix
# categorize matrix_train and add scores for Recency – apply business rule
if (matrix_train [i,2] < 15) {
Â Â Â Â matrix_train [i,6] ïƒŸ 3
} else if ((matrix_train [i,2] < 26) & (matrix_train [i,2] >= 15)) {
Â Â Â Â matrix_train [i,6] ïƒŸ 2
} else {Â matrix_train [i,6] ïƒŸ 1Â }
Â Â Â Â Â }
# categorize matrix_test and add scores for Recency – apply business rule
if (matrix_test [i,2] < 15) {
Â Â Â Â matrix_test [i,6] ïƒŸ 3
} else if ((matrix_test [i,2] < 26) & (matrix_test [i,2] >= 15)) {
Â Â Â Â matrix_test [i,6] ïƒŸ 2
} else {Â matrix_test [i,6] ïƒŸ 1 }
Â Â Â Â Â }
# convert matrix_train back to dataframe
# sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.)
# add column in train dataframe hold Frequency score (rank) for each customer
# convert dataframe to matrix
# convert matrix_test back to dataframe
# sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.)
# add column in test dataframe hold Frequency score (rank) for each customer
# convert dataframe to matrix
#categorize matrix_train, add scores for Frequency
Â if (matrix_train[i,3] >= 25) {
Â Â Â matrix_train[i,7] ïƒŸ 3
Â } else if ((matrix_train[i,3] > 15) & (matrix_train[i,3] < 25)) {
Â Â Â matrix_train[i,7] ïƒŸ 2
Â } else {Â matrix_train[i,7] ïƒŸ 1Â }
Â Â Â Â Â Â Â Â }
#categorize matrix_test, add scores for Frequency
Â if (matrix_test[i,3] >= 25) {
Â Â Â matrix_test[i,7] ïƒŸ 3
Â } else if ((matrix_test[i,3] > 15) & (matrix_test[i,3] < 25)) {
Â Â Â matrix_test[i,7] ïƒŸ 2
Â } else {Â Â matrix_test[i,7] ïƒŸ 1Â }
}
#Â convert matrix test back to dataframe
# sort (train) dataframe 1rst on Recency rank (desc.) & 2^{nd }Frequency rank (desc.)
# add another column for the Sum of Recency rank and Frequency rank
# convert dataframe to matrix
#Â convert matrix test back to dataframe
# sort (train) dataframe 1rst on Recency rank (desc.) & 2^{nd }Frequency rank (desc.)
# add another column for the Sum of Recency rank and Frequency rank
# convert dataframe to matrix
# sum Recency rank and Frequency rank for train file
{ matrix_train[i,8] ïƒŸ matrix_train[i,6] + matrix_train[i,7] }
# sum Recency rank and Frequency rank for test file
{ matrix_test[i,8] ïƒŸ matrix_test[i,6] + matrix_test[i,7] }
# convert matrix_train back to dataframe
# sort train dataframe according to total rank in descending order
# convert sorted train dataframe
# convert matrix_test back to dataframe
# sort test dataframe according to total rank in descending order
# convert sorted test dataframe to matrix
# apply business rule check & count customers whose score >= 4 and that Have Donated, train file
# check & count for all customers that have donated in the train dataset
Â if ((matrix_train[i,8] >= 4) & (matrix_train[i,5] == 1)) {
Â Â Â count_train_predicted_donations = count_train_predicted_donations + 1Â }
if ((matrix_train[i,8] >= 4) & (matrix_train[i,5] == 0)) {
Â Â Â false_positives_train_counter = false_positives_train_counter + 1}
Â if (matrix_train[i,8] >= 4) {
Â Â Â Â Â counter_train ïƒŸ counter_train + 1
Â Â Â }
Â if (matrix_train[i,5] == 1) {
Â Â Â number_donation_instances_whole_train ïƒŸ number_donation_instances_whole_train + 1
Â }
}
# apply business rule check & count customers whose score >= 4 and that Have Donated, test file
# check & count for all customers that have donated in the test dataset
Â if ((matrix_test[i,8] >= 4) & (matrix_test[i,5] == 1)) {
Â Â Â count_test_predicted_donations = count_test_predicted_donations + 1Â }
if ((matrix_test[i,8] >= 4) & (matrix_test[i,5] == 0)) {
Â Â Â false_positives_test_counter = false_positives_test_counter + 1}
Â if (matrix_test[i,8] >= 4) {
Â Â Â Â Â counter_test ïƒŸ counter_test + 1
Â Â Â }
Â if (matrix_test[i,5] == 1) {
Â Â Â Â number_donation_instances_whole_test ïƒŸ number_donation_instances_whole_test + 1
Â Â }
}
# convert matrix_train to dataframe
# remove the group of customers who are less likely to donate again in the future from train file
# convert matrix_train to dataframe
# remove the group of customers who are less likely to donate again in the future from test file
# save final train dataframe as a CSV in the specified directory – reduced target future customers
#save final test dataframe as a CSV in the specified directory – reduced target future customers
#train precision=number of relevant instances retrieved / number of retrieved instances collect.530
precision_train ïƒŸÂ count_train_predicted_donations / counter_train
# train recall = number of relevant instances retrieved / number of relevant instances in collect.530
recall_train ïƒŸ count_train_predicted_donations / number_donation_instances_whole_train
# measure combines Precision&Recall is harmonic mean of Precision&Recall balanced Fscore for # train file
f_balanced_score_train ïƒŸ 2*(precision_train*recall_train)/(precision_train+recall_train)
# test precision
precision_test ïƒŸ count_test_predicted_donations / counter_test
# test recall
recall_test ïƒŸ count_test_predicted_donations / number_donation_instances_whole_test
# the balanced Fscore for test file
f_balanced_score_test ïƒŸ 2*(precision_test*recall_test)/(precision_test+recall_test)
# error in precision
error_precision ïƒŸ abs(precision_trainprecision_test)
# error in recall
error_recall ïƒŸ abs(recall_trainrecall_test)
# error in fbalanced scores
error_f_balanced_scores ïƒŸ abs(f_balanced_score_trainf_balanced_score_test)
# Print Statistics for verification and validation
# confusion matrix (true positives, false positives, false negatives, true negatives)
# calculate true positives for train which is the variable ‘count_train_predicted_donations’
# calculate false positives for train which is the variable ‘false_positives_train_counter’
# calculate false negatives for train
# calculate true negatives for train
# calculate true positives for test which is the variable ‘count_test_predicted_donations’
# calculate false positives for test which is the variable ‘false_positives_test_counter’
# calculate false negatives for test
# calculate true negatives for test
# print confusion matrix for train
Â geom_tile(aes(fill = collect_train), colour = “white”) +
Â geom_text(aes(label = sprintf(“%1.0f”, collect_train)), vjust = 1) +
Â scale_fill_gradient(low = “blue”, high = “red”) +
Â theme_bw() + theme(legend.position = “none”)
#Â print confusion matrix for test
Â geom_tile(aes(fill = collect_test), colour = “white”) +
Â geom_text(aes(label = sprintf(“%1.0f”, collect_test)), vjust = 1) +
Â scale_fill_gradient(low = “blue”, high = “red”) +
Â theme_bw() + theme(legend.position = “none”)
# MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for train values
# print MCC for train
# MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for test values
# print MCC for test
# print MCC err between train and err
# Total = TP + TN + FP + FN for train
# Total = TP + TN + FP + FN for test
# totalAccuracy = (TP + TN) / Total – for train values
# totalAccuracy = (TP + TN) / Total – for test values
# randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total)Â for train values
# randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total)Â for test values
# kappa = (totalAccuracy – randomAccuracy) / (1 – randomAccuracy) for train
# kappa = (totalAccuracy – randomAccuracy) / (1 – randomAccuracy) for test
# print kappa error
You have to be 100% sure of the quality of your product to give a moneyback guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarismdetection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more