30 Naive-Bayes ile Meme Kanser Veri Analizi

İlk olarak aşağıda bulunan paketleri kurmanız gerekidir:

install.packages("psych")
install.packages("corrplot")
install.packages("ggplot2")
install.packages("caret")
install.packages("ggpubr")
install.packages("klaR")

Daha sonra bu paketleri yükleyelim:

library(psych)
library(corrplot)

corrplot 0.95 loaded

library(ggplot2)


Attaching package: 'ggplot2'

The following objects are masked from 'package:psych':

    %+%, alpha

library(caret)

Loading required package: lattice

Registered S3 method overwritten by 'future':
  method               from      
  all.equal.connection parallelly

library(ggpubr)
library(klaR)

Loading required package: MASS

Bu çalışmada kullanılan tutorial:

https://www.kaggle.com/code/lbronchal/breast-cancer-dataset-analysis

Veri setimizi alalım (Siz buraya kendi dosya yolunuzu yazacaksınız). Veri seti link: https://www.kaggle.com/code/lbronchal/breast-cancer-dataset-analysis/input

İlk olarak verimizi yükleyelim:

bc_data <- read.csv("https://raw.githubusercontent.com/emrahkirdok/ybva/main/05-istatistik/data.csv")

Teşhisi faktör olarak düzenleyelim (M:malign tümör, B:benign tümör) ve veri setini inceleyelim:

bc_data$diagnosis <- as.factor(bc_data$diagnosis)
bc_data[,33] <- NULL
summary(bc_data)

       id            diagnosis  radius_mean      texture_mean  
 Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
 1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
 Median :   906024             Median :13.370   Median :18.84  
 Mean   : 30371831             Mean   :14.127   Mean   :19.29  
 3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
 Max.   :911320502             Max.   :28.110   Max.   :39.28  
 perimeter_mean     area_mean      smoothness_mean   compactness_mean 
 Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
 1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
 Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
 Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
 3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
 Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
 concavity_mean    concave.points_mean symmetry_mean    fractal_dimension_mean
 Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
 1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
 Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
 Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
 3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
 Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
   radius_se        texture_se      perimeter_se       area_se       
 Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
 1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
 Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
 Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
 3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
 Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
 smoothness_se      compactness_se      concavity_se     concave.points_se 
 Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
 1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
 Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
 Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
 3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
 Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
  symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
 Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
 1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
 Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
 Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
 3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
 Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
 perimeter_worst    area_worst     smoothness_worst  compactness_worst
 Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
 1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
 Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
 Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
 3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
 Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
 concavity_worst  concave.points_worst symmetry_worst   fractal_dimension_worst
 Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
 1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
 Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
 Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
 3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
 Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

Teşhis bilgisine bakalım:

prop.table(table(bc_data$diagnosis))


        B         M 
0.6274165 0.3725835

Veri setini inceleyelim:

describe(bc_data)

                        vars   n        mean           sd    median    trimmed
id                         1 569 30371831.43 125020585.61 906024.00 7344332.77
diagnosis*                 2 569        1.37         0.48      1.00       1.34
radius_mean                3 569       14.13         3.52     13.37      13.82
texture_mean               4 569       19.29         4.30     18.84      19.04
perimeter_mean             5 569       91.97        24.30     86.24      89.74
area_mean                  6 569      654.89       351.91    551.10     606.13
smoothness_mean            7 569        0.10         0.01      0.10       0.10
compactness_mean           8 569        0.10         0.05      0.09       0.10
concavity_mean             9 569        0.09         0.08      0.06       0.08
concave.points_mean       10 569        0.05         0.04      0.03       0.04
symmetry_mean             11 569        0.18         0.03      0.18       0.18
fractal_dimension_mean    12 569        0.06         0.01      0.06       0.06
radius_se                 13 569        0.41         0.28      0.32       0.36
texture_se                14 569        1.22         0.55      1.11       1.16
perimeter_se              15 569        2.87         2.02      2.29       2.51
area_se                   16 569       40.34        45.49     24.53      31.69
smoothness_se             17 569        0.01         0.00      0.01       0.01
compactness_se            18 569        0.03         0.02      0.02       0.02
concavity_se              19 569        0.03         0.03      0.03       0.03
concave.points_se         20 569        0.01         0.01      0.01       0.01
symmetry_se               21 569        0.02         0.01      0.02       0.02
fractal_dimension_se      22 569        0.00         0.00      0.00       0.00
radius_worst              23 569       16.27         4.83     14.97      15.73
texture_worst             24 569       25.68         6.15     25.41      25.39
perimeter_worst           25 569      107.26        33.60     97.66     103.42
area_worst                26 569      880.58       569.36    686.50     788.02
smoothness_worst          27 569        0.13         0.02      0.13       0.13
compactness_worst         28 569        0.25         0.16      0.21       0.23
concavity_worst           29 569        0.27         0.21      0.23       0.25
concave.points_worst      30 569        0.11         0.07      0.10       0.11
symmetry_worst            31 569        0.29         0.06      0.28       0.28
fractal_dimension_worst   32 569        0.08         0.02      0.08       0.08
                             mad     min          max        range skew
id                      65567.98 8670.00 911320502.00 911311832.00 6.44
diagnosis*                  0.00    1.00         2.00         1.00 0.53
radius_mean                 2.82    6.98        28.11        21.13 0.94
texture_mean                4.17    9.71        39.28        29.57 0.65
perimeter_mean             18.84   43.79       188.50       144.71 0.99
area_mean                 227.28  143.50      2501.00      2357.50 1.64
smoothness_mean             0.01    0.05         0.16         0.11 0.45
compactness_mean            0.05    0.02         0.35         0.33 1.18
concavity_mean              0.06    0.00         0.43         0.43 1.39
concave.points_mean         0.03    0.00         0.20         0.20 1.17
symmetry_mean               0.03    0.11         0.30         0.20 0.72
fractal_dimension_mean      0.01    0.05         0.10         0.05 1.30
radius_se                   0.16    0.11         2.87         2.76 3.07
texture_se                  0.47    0.36         4.88         4.52 1.64
perimeter_se                1.14    0.76        21.98        21.22 3.43
area_se                    13.63    6.80       542.20       535.40 5.42
smoothness_se               0.00    0.00         0.03         0.03 2.30
compactness_se              0.01    0.00         0.14         0.13 1.89
concavity_se                0.02    0.00         0.40         0.40 5.08
concave.points_se           0.01    0.00         0.05         0.05 1.44
symmetry_se                 0.01    0.01         0.08         0.07 2.18
fractal_dimension_se        0.00    0.00         0.03         0.03 3.90
radius_worst                3.65    7.93        36.04        28.11 1.10
texture_worst               6.42   12.02        49.54        37.52 0.50
perimeter_worst            25.01   50.41       251.20       200.79 1.12
area_worst                319.65  185.20      4254.00      4068.80 1.85
smoothness_worst            0.02    0.07         0.22         0.15 0.41
compactness_worst           0.13    0.03         1.06         1.03 1.47
concavity_worst             0.20    0.00         1.25         1.25 1.14
concave.points_worst        0.07    0.00         0.29         0.29 0.49
symmetry_worst              0.05    0.16         0.66         0.51 1.43
fractal_dimension_worst     0.01    0.06         0.21         0.15 1.65
                        kurtosis         se
id                         41.66 5241135.60
diagnosis*                 -1.73       0.02
radius_mean                 0.81       0.15
texture_mean                0.73       0.18
perimeter_mean              0.94       1.02
area_mean                   3.59      14.75
smoothness_mean             0.82       0.00
compactness_mean            1.61       0.00
concavity_mean              1.95       0.00
concave.points_mean         1.03       0.00
symmetry_mean               1.25       0.00
fractal_dimension_mean      2.95       0.00
radius_se                  17.45       0.01
texture_se                  5.26       0.02
perimeter_se               21.12       0.08
area_se                    48.59       1.91
smoothness_se              10.32       0.00
compactness_se              5.02       0.00
concavity_se               48.24       0.00
concave.points_se           5.04       0.00
symmetry_se                 7.78       0.00
fractal_dimension_se       25.94       0.00
radius_worst                0.91       0.20
texture_worst               0.20       0.26
perimeter_worst             1.04       1.41
area_worst                  4.32      23.87
smoothness_worst            0.49       0.00
compactness_worst           2.98       0.01
concavity_worst             1.57       0.01
concave.points_worst       -0.55       0.00
symmetry_worst              4.37       0.00
fractal_dimension_worst     5.16       0.00

Korelasyon matrixi ile ilişkili olan verileri inceleyelim:

corr_mat <- cor(bc_data[,3:ncol(bc_data)])
corrplot(corr_mat, order = "hclust", tl.cex = 1, addrect = 8)

Etkenlerden birini seçip plotlayalım:

ggplot(bc_data, aes(x=texture_mean)) + geom_density(alpha=0.5, aes(fill=diagnosis)) + labs(title="Texture distribution by outcome")

Şimdi yavaşça işin makine öğrenmesi kısmına başlıyoruz. Burada veri setimiz rastgele olarak iki parçaya bölünüyor: Makineye “öğreten” training set ve bunu test eden testing set.

set.seed(1234)
data_index <- createDataPartition(bc_data$diagnosis, p=0.7, list = FALSE)
train_data <- bc_data[data_index, -1]
test_data <- bc_data[-data_index, -1]

şimdi de bu bölünen kısımların oranlarına bakalım:

Veri setimiz:

prop.table(table(bc_data$diagnosis)) * 100


       B        M 
62.74165 37.25835

Training setimiz:

prop.table(table(train_data$diagnosis)) * 100


       B        M 
62.65664 37.34336

Testing setimiz:

prop.table(table(test_data$diagnosis)) * 100


       B        M 
62.94118 37.05882

Verilerimizi görselleştirelim:

pca_res <- prcomp(bc_data[,3:ncol(bc_data)], center = TRUE, scale = TRUE)
plot(pca_res, type="l")

summary(pca_res)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
                           PC8    PC9    PC10   PC11    PC12    PC13    PC14
Standard deviation     0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
Cumulative Proportion  0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
                          PC15    PC16    PC17    PC18    PC19    PC20   PC21
Standard deviation     0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
Cumulative Proportion  0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
                          PC22    PC23   PC24    PC25    PC26    PC27    PC28
Standard deviation     0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
Cumulative Proportion  0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
                          PC29    PC30
Standard deviation     0.02736 0.01153
Proportion of Variance 0.00002 0.00000
Cumulative Proportion  1.00000 1.00000

Teşhis bilgisinin temel bileşen analizini yapalım:

pca_df <- as.data.frame(pca_res$x)
ggplot(pca_df, aes(x=PC1, y=PC2, col=bc_data$diagnosis)) + geom_point(alpha=0.5)

Teşhis bilgisi için her bileşeni ayrıca inceleyelim:

g_pc1 <- ggplot(pca_df, aes(x=PC1, fill=bc_data$diagnosis)) + geom_density(alpha=0.25)  
g_pc2 <- ggplot(pca_df, aes(x=PC2, fill=bc_data$diagnosis)) + geom_density(alpha=0.25)  
ggarrange(g_pc1, g_pc2, ncol=2, common.legend = T)

Modeli oluşturalım ve görüntüleyelim:

fitControl <- trainControl(method="cv", number = 5, preProcOptions = list(thresh = 0.99), classProbs = TRUE, summaryFunction = twoClassSummary)

model_nb <- train(diagnosis~., train_data, method="nb", metric="ROC", preProcess=c('center', 'scale'), trace=FALSE, trControl=fitControl)

Modeli test edelim ve inceleyelim (bunu aslında yukarıda da yaptık)

Predict <- predict(model_nb,newdata = test_data)
confusionMatrix(Predict, test_data$diagnosis)

Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 102  10
         M   5  53
                                          
               Accuracy : 0.9118          
                 95% CI : (0.8586, 0.9498)
    No Information Rate : 0.6294          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8077          
                                          
 Mcnemar's Test P-Value : 0.3017          
                                          
            Sensitivity : 0.9533          
            Specificity : 0.8413          
         Pos Pred Value : 0.9107          
         Neg Pred Value : 0.9138          
             Prevalence : 0.6294          
         Detection Rate : 0.6000          
   Detection Prevalence : 0.6588          
      Balanced Accuracy : 0.8973          
                                          
       'Positive' Class : B

Her tahmin yürütücü değişkenin outcome/çıktı üzerindeki etkilerini kıyaslayalım:

X <- varImp(model_nb)
plot(X)

Yorum artık sizde, yukarıdaki grafiğe bakarak bir kişiye meme kanseri teşhisi konmasındaki en önemli etkenler hangileri olarak görünüyor?