International Journal of Engineering Business

and Social Science

Vol. 2 No. 04, March-April 2024, pages: 1194-1204

e-ISSN: 2980-4108, p-ISSN: 2980-4272

https://ijebss.ph/index.php/ijebss

Sentiment Analysis on ChatGPT App Reviews on Google Play

Store Using Random Forest Algorithm, Support Vector Machine

and Naïve Bayes

Gilbert Jeffson Sagala

, Yusran Timur Samuel

Universitas Advent Indonesia Bandung, Indonesia

Email: gilbertjeffsonsagala@gmail.com

, y.tarihoran@unai.edu

Keywords

Abstract

ChatGPT; Random

Forest; Support Vector

Machine; Naïve Bayes;

Google Play Store.

This study aims to conduct a Sentiment Analysis on ChatGPT App reviews on the

Google Play Store using three classification methods: Random Forest Algorithm,

Support Vector Machine (SVM), and Naïve Bayes. The main purpose of this study is

to detail and understand user sentiment towards the application. From a total of

2652 review data regarding ChatGPT performance from July 28, 2023, to January

28, 2024, the results were 2326 (87.71%) positive reviews and 326 (12.29%)

negative reviews, which means that the public is more dominant in responding

positively to the use of ChatGPT based on Google Play Store ratings. In this study,

researchers used the f1-score to see which method works best because the data has

an imbalance of data, so the f1-score is the best way to provide information about

how well the model handles minority classes. Through the classification of three

different algorithms with testing data taken from 796 (30%) from a total of 2652

rating reviews, it was found that Random Forest got an f1-score of 90% with

positive correct data as much as 87.43% and negative accurate data as much as

0.75%, Support Vector Machine got an f1-score value of 90% with positive valid

data as much as 86.80% and negative correct data as much as 0.13%, and Naïve

Bayes received an f1-score of 87% with positive, accurate data of 88.06% and

negative valid data of 0.12%. Therefore, it can be concluded from this study that

users who experienced the development of the ChatGPT application felt a more

striking positive impact, and the Support Vector Machine and Random Forest

methods became the most effective methods in this study, proven by the highest f1-

score value.

for possible open-access publication

under the terms and conditions of the Creative Commons Attribution (CC BY SA)

license (https://creativecommons.org/licenses/by-sa/4.0/).

1. Introduction

The rapid development of computer-based information technology significantly impacts changes in various

aspects of human life. Artificial Intelligence is the latest technology product resulting from rapid technological

advances. Artificial Intelligence allows computers to carry out many tasks that humans do, making it a widely utilised

1195 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204

technology product in application development today. This makes it easier for humans to meet their various needs

(Rifaldi, Ramadhan, & Jaelani, 2023).

In November 2022, an AI research lab called OpenAI launched a chatbot application called ChatGPT. This

chatbot is a natural language processing technology that can respond to human questions through text (prompts) typed

in the application. What attracts a lot of attention is that the answers given by ChatGPT look well structured, the

relationships between words or sentences are coherent, the accuracy is quite good, and they can remember previous

conversations. As of November 2023, ChatGPT has 100 million weekly active users (Setiawan & Luthfiyani, 2023).

Reviews express a person's assessment of a product or service. Sentiment analysis helps us understand what

customers think through their reviews about the product or service. These reviews can be a valuable source of

information for consumers. For example, before buying a product, most people look for reviews about the product to

help them make decisions (Hasibuan & Heriyanto, 2022). As a digital platform, Google Play Store allows users to

share their experiences through app reviews.

In a study on PSBB sentiment analysis by comparing random forest classification methods and support vector

machines conducted by Adrian et al. (Adrian, Putra, Rafialdy, & Rakhmawati, 2021), the results showed that the

Random Forest algorithm had an accuracy rate of 58%, with precision, recall, and f1-score values of 35%, 58%, and

44% respectively. Meanwhile, the Support Vector Machine algorithm achieved an accuracy rate of 56%, with

precision, recall, and f1-score values of 38%, 56%, and 44%, respectively. The performance of these two algorithms

is considered low because the dataset used is very limited, consisting of only 466 tweet data (Ratnawati &

Sulistyaningrum, 2020).

Then, research on sentiment analysis about the Ruangguru application using naïve bayes, random forest and

support vector machine classification methods conducted by Evita Fitri et al. (Fitri, 2020) found that the Random

Forest model had the highest accuracy of 97.16%, with AUC reaching 0.996. Meanwhile, the Support Vector Machine

algorithm showed an accuracy of 96.01%, with an AUC of 0.543. On the other hand, the Naïve Bayes algorithm has

the lowest accuracy, with a value of 94.16% and an AUC of 0.999 (Muslimin & Lusiana, 2023). Thus, based on the

test results, it can be concluded that Random Forest performs better than the other two algorithms (Fernández-

Gavilanes, Álvarez-López, Juncal-Martínez, Costa-Montenegro, & González-Castaño, 2016).

Based on the background description described in this study, the author chose the title "Sentiment Analysis on

ChatGPT Application Reviews on the Google Play Store Using the Random Forest Algorithm Method, Support Vector

Machine and Naïve Bayes". This study aims to see the accuracy of each classification method of the three methods

and compare the three (Prayoginingsih & Kusumawardani, 2018).

2. Materials and Methods

This study is an experiment in sentiment analysis of ChatGPT reviews by applying Random Forest, Support

Vector Machine, and Naïve Bayes classification models. The stages start from dataset retrieval, data labelling, text

preprocessing, term weighting, algorithm implementation, classification results, and evaluation (Fitri, 2020).

1196 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:

Figure 1

Stages of Sentiment Analysis of ChatGPT reviews

Sentiment Analysis

Sentiment analysis is one of the techniques used to recognise an opinion or feeling conveyed through a text or

document, as well as how that opinion is classified as positive or negative. Sentiment analysis seeks to evaluate various

aspects in standard language to help an institution or company understand positive and negative opinions regarding

the products they provide (Tuhuteru & Iriani, 2018).

The sentiment itself can be interpreted as an emerging concept in which everyone's different emotions are

determined by the content of the text so that it can be processed to extract the opinions and sentiments of many people.

In sentiment analysis, three views can guide agencies or companies to obtain information about the products' quality:

positive, negative, and neutral (Klyueva, 2019).

Sentiment analysis is a new section of research in Natural Language Processing (NLP) that aims to find

subjectivity in texts or documents to classify opinions or sentiments. Three techniques are generally applied in the

sentiment classification method: Machine Learning, lexicon-based, and Hybrid Approach. Today, sentiment analysis

often uses Machine Learning techniques because of the method's ability to predict sentiment polarity based on prepared

data.

Dataset Collection

In performing sentiment analysis, data were collected from a review of the ChatGPT app on the Google Play

Store. Data retrieval uses scraping techniques with Python libraries using Google Play Scraper. The data for this

sentiment analysis is 2652 text reviews with the latest or most recent sorting reviews for the last 6 months, from July

28, 2023, to January 28, 2024.

Term Weighting

In this method, each word in the review will be given a weight or rating based on its significance in context. In

other words, this method converts text into numbers that represent values. The technique used is TF-IDF (Term

Frequency-Inverse Document Frequency), which combines the frequency of the term (F) and the presence of a term in

the view that is irreva to the topi (IDF) [1]. The loin orla does the value of T each word.

𝑁𝑚𝑏𝑒𝑟𝑓 𝑤𝑜𝑟𝑑𝑠 𝑡𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

𝑡𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑 𝑐𝑜𝑢𝑛𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢 𝑑

(1)a

Number of documents in the corpus

number of documents in corpus d containing the word t

) (2)

1197 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204

After the TF and IDF values are obtained with the previous formula, TF-IDF can be obtained with the

formula below.

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D) (3)

3. Results and Discussions

In the initial phase, the study began with a dataset of 2652 review data collected from July 28, 2023, to

January 28, 2024.

Figure 2 Dataset Collection

After successfully collecting the dataset, the next step is to clean the data, such as removing emojis,

numbers, and punctuation marks and changing uppercase letters to lowercase.

Gambar 3 Data Cleansing and Case Folding

Furthermore, tokenising or separating text-type data into per word is carried out.

Figure 4 Tokenizing a Dataset

1198 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:

The final stage is removing words that have no effect and the removal of affixes in words.

Figure 5 Stopword Removal and Stemming

Researchers conducted sentiment analysis using Google Colab and Python programming language. The

study was conducted on 2652 data, with 2326 data labelled as positive (87.71%) and 326 as negative (12.29%).

Researchers divided the data into training data as much as 70% (1,856 reviews) and testing data as much as

30% (796 reviews).

Figure 6 Sentiment chart before in SMOTE

An imbalance in the amount of data between positive reviews and negative reviews can result in an

imbalance of data that can lead to errors in classifying minority classes that tend to be majority classes.

Therefore, researchers use oversampling to balance data by adding data in minority classes. One of the

oversampling methods used is the Synthetic Minority Oversampling Technique (SMOTE), which deals with

unbalanced data problems or overfitting problems (Utami, 2022).

1199 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204

Figure 7 Sentiment Chart after in SMOTE

Figure 7 is a form of the dataset that has been in SMOTE by obtaining a balanced amount of data, namely

2326 positive and 2326 negative data. After that, researchers enter the data into each classification algorithm

and get the following results.

Random Forest

Table 1

Random Forest melalui Confusion Matrix

Negative

Positive

Negative

Positive

696

Researchers used the Scikit-Learn library to apply Random Forest classification to data. The analysis

showed that out of 796 cases predicted to be positive, 696 (or 87.43%) were completely positive (True

Positive), indicating that the model had high accuracy in identifying positive cases. In addition, of the 796

instances predicted negative, 31 (or 3.89%) were negative (True Negative), illustrating the model's ability to

identify negative instances correctly. On the other hand, 63 cases (7.91%) were incorrectly predicted as positive

when, in fact, they were negative (False Positive), indicating an error in classifying these cases. In comparison,

only 6 cases (0.75%) were incorrectly predicted as negative when they were positive (False Negative),

indicating that the model may tend to ignore some positive cases (Oktavia, Ramadahan, & Minarto, 2023).

1200 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:

Table 2

Results with Random Forest

Precision

Recall

Negative

0.84

0.33

Positive

0.92

0.99

Accuracy: 0.91

F1-Score: 0.90

Table 2 shows the results of classification using the random forest algorithm. 91% accuracy indicates

how well the model classifies all data correctly. The 90% F1-score is a combined measure of precision and recall,

with a precision class negative of 84% and a precision class positive of 92%, describing the model's accuracy in

classifying each class. Meanwhile, recall class negative reached 33% and recall class positive reached 99%,

indicating the model's ability to identify negative and positive sentiments specifically. These results show that

the recall value of the positive class is much higher than that of the recall class negative, indicating that the

model is superior in recognising and classifying positive sentiment (Dey, Chakraborty, Biswas, Bose, & Tiwari,

2016).

Support Vector Machine

Table 3

Support Vector Machine melalui Confusion Matrix

Negative

Positive

Negative

Positive

691

Researchers used the Scikit-Learn library to apply Support Vector Machine classification to data. The

results indicate that out of 796 cases predicted to be positive, 691 (or 86.80%) are positive (True Positive),

demonstrating the model's accuracy in identifying positive cases. In addition, of the 796 cases predicted to be

negative, 34 (or 4.27%) were negative, illustrating the model's ability to identify negative instances correctly.

On the other hand, 60 cases (7.53%) were incorrectly predicted as positive when they were negative (False

Positive), indicating an error in classifying these cases. In comparison, only 11 cases (1.38%) were incorrectly

predicted as negative when positive (False Negative), indicating that the model may ignore some positive cases.

Table 4

Results with Support Vector Machine

Precision

Recall

Negative

0.76

0.36

Positive

0.92

0.98

Accuracy: 0.91

F1-Score: 0.90

1201 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204

Table 4 shows the classification results using the support vector machine algorithm. 91% accuracy

indicates how well the model classifies all data correctly. The 90% F1-score is a combined measure of precision

and recall, with a precision class negative of 76% and a precision class positive of 92%, reflecting the model's

accuracy in classifying each class. Meanwhile, recall class negative reached 36% and recall class positive

reached 98%, demonstrating the model's ability to identify negative and positive sentiments specifically. A

higher positive class recall value than negative class recall indicates that the dominant model can recognise and

classify positive sentiment well.

Naïve Bayes Classifier

Table 5

Naïve Bayes through the Confusion Matrix

Negative

Positive

Negative

Positive

701

Researchers used the Scikit-Learn library to implement the classification of Naïve Bayes with

Multinomial Naïve Bayes types, specifically designed for multinomial distributions such as text data

represented in the form of TF-IDF. The analysis showed that of the total 796 data predicted positive, as many

as 701 (or 88.06%) were True Positive (TP), illustrating the model's ability to identify positive cases correctly.

In addition, out of a total of 796 data predicted to be negative, only 16 (or 2.01%) were True Negative (TN),

demonstrating the model's ability to classify negative cases correctly. However, there were 78 data (or 9.79%)

that were incorrectly predicted as positive when in fact they were negative (False Positive), and only 1 data (or

0.12%) was incorrectly predicted as negative when in fact it was positive (False Negative), indicating some

errors in classification.

Table 6

Results with Naïve Bayes

Precision

Recall

Negative

0.94

0.17

Positive

0.90

1.00

Accuracy: 0.90

F1-Score: 0.87

Table 6 shows the results of classification with the naïve Bayes algorithm. 90% accuracy indicates how

well the model classifies all data correctly. The 87% F1-score is a combined measure of precision and recall,

with a precision class negative of 94% and a precision class positive of 90%, illustrating the model's accuracy

in classifying each class. Meanwhile, recall class negative reached 17% and recall class positive reached 100%,

indicating the model's ability to identify negative and positive sentiments specifically. A positive recall class

value that achieves a perfect score suggests that the dominant model can recognise and classify positive

sentiments well.

The following is a combination of the results of each algorithm classification regarding the sentiment

data analysis method that has been carried out.

1202 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:

Table 7

Overall algorithm classification results

Algoritma

Acc

ura

Positive

Negative

Confusion Matrix

F1-

Score

Precisi

Recal

Preci

sion

Recal

Random

Forest

92%

99%

84%

33%

696

90%

Support

Vector

Machine

92%

98%

76%

36%

691

90%

Naïve

Bayes

90%

100

94%

17%

701

87%

Word Cloud

A word cloud is a visual representation of text, where the font size signifies how often the word appears. Here

is a word cloud that visualises data with their respective sentiment labels. Figure 10 shows a word cloud with a positive

sentiment, while Figure 11 shows a negative sentiment.

Figure 10 Positive Word Cloud Figure 11 Negative Word Cloud

Figure 10 shows words that show positive sentiment results, such as the words "help", "good", "cool",

"good", "thank you", "steady", "accurate", which means that most positive reviews are interested in the launch

of ChatGPT which is a new thing. Figure 11 shows the results of negative sentiments such as the words "please",

"wrong", "error", "login", "accurate", "different", and "answer", which means there are several reviews that

contain their dissatisfaction with the presence of ChatGPT (Farid, Enri, & Umaidah, 2021).

4. Conclusion

Based on this study, from a total of 2652 review data on ChatGPT performance from July 28, 2023, to

January 28, 2024, it was found that as many as 2326 (87.71%) reviews were positive, while 326 (12.29%)

reviews were negative. This shows that people tend to respond positively to using ChatGPT based on ratings on

the Google Play Store. In this study, researchers used the f1-score as the best evaluation method because the

data was imbalanced, and the f1-score was considered the best way to measure how well the model handled

minority classes. Through the classification of three different algorithms using testing data as much as 796

(30%) from a total of 2652 reviews, it was found that Random Forest obtained an f1-score value of 90% with

positive correct data of 87.43% and negative correct data of 0.75%, Support Vector Machine got an f1-score

value of 90% with positive, accurate data of 86.80% and negative valid data of 0.13%. Naïve Bayes received an

1203 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204

f1-score of 87% with positive correct data of 88.06% and negative correct data of 0.12%. The results show that

the Naïve Bayes classification algorithm has analytical capabilities under the Support Vector Machine and

Random Forest, which has the model's ability to handle data more accurately, thus giving an equally high f1-

score value in this sentiment analysis. Overall, the community responded to the use of the ChatGPT application

with positive responses. Based on the level of accuracy obtained, it is concluded that the public's response to

the ChatGPT application tends to be positive, which is reflected in the many positive comments given to ChatGPT

on the Google Play Store.

1204 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS

IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:

5. References

Adrian, Muhammad Rivza, Putra, Muhammad Papuandivitama, Rafialdy, Muhammad Hilman, & Rakhmawati,

Nur Aini. (2021). Perbandingan Metode Klasifikasi Random Forest dan SVM Pada Analisis Sentimen

PSBB. Jurnal Informatika Upgris, 7(1). https://doi.org/10.26877/jiu.v7i1.7099

Dey, Lopamudra, Chakraborty, Sanjay, Biswas, Anuraag, Bose, Beepa, & Tiwari, Sweta. (2016). Sentiment

analysis of review datasets using naive Bayes and k-nn classifier. ArXiv Preprint ArXiv:1610.09982.

Farid, Farid, Enri, Ultach, & Umaidah, Yuyun. (2021). Sistem Pendukung Keputusan Rekomendasi Topik Skripsi

Menggunakan Naïve Bayes Classifier. JOINTECS (Journal of Information Technology and Computer

Science), 6(1), 35–42.

Fernández-Gavilanes, Milagros, Álvarez-López, Tamara, Juncal-Martínez, Jonathan, Costa-Montenegro, Enrique,

& González-Castaño, Francisco Javier. (2016). Unsupervised method for sentiment analysis in online

texts. Expert Systems with Applications, 58, 57–75.

Fitri, Evita. (2020). Analisis Sentimen Terhadap Aplikasi Ruangguru Menggunakan Algoritma Naive Bayes,

Random Forest Dan Support Vector Machine. Jurnal Transformatika, 18(1), 71–80.

https://doi.org/10.26623/transformatika.v18i1.2317

Hasibuan, Ernianti, & Heriyanto, Elmo Allistair. (2022). Analisis Sentimen Pada Ulasan Aplikasi Amazon

Shopping Di Google Play Store Menggunakan Naive Bayes Classifier. Jurnal Teknik Dan Science, 1(3), 13–

24. https://doi.org/10.56127/jts.v1i3.434

Klyueva, Irina. (2019). Improving the quality of the multiclass SVM classification based on feature engineering.

2019 1st International Conference on Control Systems, Mathematical Modelling, Automation and Energy

Efficiency (SUMMA), 491–494. IEEE.

Muslimin, Muhammad, & Lusiana, Veronica. (2023). Analisis Sentimen Terhadap Kenaikan Harga Bahan Pokok

Menggunakan Metode Naive Bayes Classifier. JURNAL MEDIA INFORMATIKA BUDIDARMA, 7(3), 1200–

1209.

Oktavia, Dea, Ramadahan, Yudhi Raymond, & Minarto, Minarto. (2023). Analisis Sentimen Terhadap Penerapan

Sistem E-Tilang Pada Media Sosial Twitter Menggunakan Algoritma Support Vector Machine (SVM).

KLIK: Kajian Ilmiah Informatika Dan Komputer, 4(1), 407–417.

Prayoginingsih, Sila, & Kusumawardani, Renny Pradina. (2018). Klasifikasi Data Twitter Pelanggan

Berdasarkan Kategori myTelkomsel Menggunakan Metode Support Vector Machine (SVM). Jurnal Sisfo,

7(02), 83–98.

Ratnawati, Luthfiana, & Sulistyaningrum, Dwi Ratna. (2020). Penerapan random forest untuk mengukur tingkat

keparahan penyakit pada daun apel. Jurnal Sains Dan Seni ITS, 8(2), A71–A77.

Rifaldi, Muhamad Ilmar, Ramadhan, Yudhi Raymond, & Jaelani, Irsan. (2023). Analisis Sentimen Terhadap

Aplikasi Chatgpt Pada Twitter Menggunakan Algoritma Naïve Bayes. J-SAKTI (Jurnal Sains Komputer Dan

Informatika), 7(2), 802–814. https://doi.org/10.30645/j-sakti.v7i2.687

Setiawan, Adi, & Luthfiyani, Ulfah Khairiyah. (2023). Penggunaan ChatGPT untuk pendidikan di era education

4.0: Usulan inovasi meningkatkan keterampilan menulis. JURNAL PETISI (Pendidikan Teknologi

Informasi), 4(1), 49–58.

Tuhuteru, Hennie, & Iriani, Ade. (2018). Analisis Sentimen Perusahaan Listrik Negara Cabang Ambon

Menggunakan Metode Support Vector Machine dan Naive Bayes Classifier. Jurnal Informatika: Jurnal

Pengembangan IT, 3(3), 394–401.

Utami, Herni. (2022). Analisis Sentimen dari Aplikasi Shopee Indonesia Menggunakan Metode Recurrent Neural

Network. Indonesian Journal of Applied Statistics, 5(1), 31–38.