International Journal of Engineering Business
and Social Science
Vol. 2 No. 04, March-April 2024, pages: 1194-1204
e-ISSN: 2980-4108, p-ISSN: 2980-4272
https://ijebss.ph/index.php/ijebss
Sentiment Analysis on ChatGPT App Reviews on Google Play
Store Using Random Forest Algorithm, Support Vector Machine
and Naïve Bayes
Gilbert Jeffson Sagala
1*
, Yusran Timur Samuel
2
Universitas Advent Indonesia Bandung, Indonesia
Email: gilbertjeffsonsagala@gmail.com
1*
, y.tarihoran@unai.edu
2
Keywords
Abstract
ChatGPT; Random
Forest; Support Vector
Machine; Naïve Bayes;
Google Play Store.
This study aims to conduct a Sentiment Analysis on ChatGPT App reviews on the
Google Play Store using three classification methods: Random Forest Algorithm,
Support Vector Machine (SVM), and Naïve Bayes. The main purpose of this study is
to detail and understand user sentiment towards the application. From a total of
2652 review data regarding ChatGPT performance from July 28, 2023, to January
28, 2024, the results were 2326 (87.71%) positive reviews and 326 (12.29%)
negative reviews, which means that the public is more dominant in responding
positively to the use of ChatGPT based on Google Play Store ratings. In this study,
researchers used the f1-score to see which method works best because the data has
an imbalance of data, so the f1-score is the best way to provide information about
how well the model handles minority classes. Through the classification of three
different algorithms with testing data taken from 796 (30%) from a total of 2652
rating reviews, it was found that Random Forest got an f1-score of 90% with
positive correct data as much as 87.43% and negative accurate data as much as
0.75%, Support Vector Machine got an f1-score value of 90% with positive valid
data as much as 86.80% and negative correct data as much as 0.13%, and Naïve
Bayes received an f1-score of 87% with positive, accurate data of 88.06% and
negative valid data of 0.12%. Therefore, it can be concluded from this study that
users who experienced the development of the ChatGPT application felt a more
striking positive impact, and the Support Vector Machine and Random Forest
methods became the most effective methods in this study, proven by the highest f1-
score value.
© 2024 by the authors. Submitted
for possible open-access publication
under the terms and conditions of the Creative Commons Attribution (CC BY SA)
license (https://creativecommons.org/licenses/by-sa/4.0/).
1. Introduction
The rapid development of computer-based information technology significantly impacts changes in various
aspects of human life. Artificial Intelligence is the latest technology product resulting from rapid technological
advances. Artificial Intelligence allows computers to carry out many tasks that humans do, making it a widely utilised
1195 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204
technology product in application development today. This makes it easier for humans to meet their various needs
(Rifaldi, Ramadhan, & Jaelani, 2023).
In November 2022, an AI research lab called OpenAI launched a chatbot application called ChatGPT. This
chatbot is a natural language processing technology that can respond to human questions through text (prompts) typed
in the application. What attracts a lot of attention is that the answers given by ChatGPT look well structured, the
relationships between words or sentences are coherent, the accuracy is quite good, and they can remember previous
conversations. As of November 2023, ChatGPT has 100 million weekly active users (Setiawan & Luthfiyani, 2023).
Reviews express a person's assessment of a product or service. Sentiment analysis helps us understand what
customers think through their reviews about the product or service. These reviews can be a valuable source of
information for consumers. For example, before buying a product, most people look for reviews about the product to
help them make decisions (Hasibuan & Heriyanto, 2022). As a digital platform, Google Play Store allows users to
share their experiences through app reviews.
In a study on PSBB sentiment analysis by comparing random forest classification methods and support vector
machines conducted by Adrian et al. (Adrian, Putra, Rafialdy, & Rakhmawati, 2021), the results showed that the
Random Forest algorithm had an accuracy rate of 58%, with precision, recall, and f1-score values of 35%, 58%, and
44% respectively. Meanwhile, the Support Vector Machine algorithm achieved an accuracy rate of 56%, with
precision, recall, and f1-score values of 38%, 56%, and 44%, respectively. The performance of these two algorithms
is considered low because the dataset used is very limited, consisting of only 466 tweet data (Ratnawati &
Sulistyaningrum, 2020).
Then, research on sentiment analysis about the Ruangguru application using naïve bayes, random forest and
support vector machine classification methods conducted by Evita Fitri et al. (Fitri, 2020) found that the Random
Forest model had the highest accuracy of 97.16%, with AUC reaching 0.996. Meanwhile, the Support Vector Machine
algorithm showed an accuracy of 96.01%, with an AUC of 0.543. On the other hand, the Naïve Bayes algorithm has
the lowest accuracy, with a value of 94.16% and an AUC of 0.999 (Muslimin & Lusiana, 2023). Thus, based on the
test results, it can be concluded that Random Forest performs better than the other two algorithms (Fernández-
Gavilanes, Álvarez-López, Juncal-Martínez, Costa-Montenegro, & González-Castaño, 2016).
Based on the background description described in this study, the author chose the title "Sentiment Analysis on
ChatGPT Application Reviews on the Google Play Store Using the Random Forest Algorithm Method, Support Vector
Machine and Naïve Bayes". This study aims to see the accuracy of each classification method of the three methods
and compare the three (Prayoginingsih & Kusumawardani, 2018).
2. Materials and Methods
This study is an experiment in sentiment analysis of ChatGPT reviews by applying Random Forest, Support
Vector Machine, and Naïve Bayes classification models. The stages start from dataset retrieval, data labelling, text
preprocessing, term weighting, algorithm implementation, classification results, and evaluation (Fitri, 2020).
1196 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:
Figure 1
Stages of Sentiment Analysis of ChatGPT reviews
Sentiment Analysis
Sentiment analysis is one of the techniques used to recognise an opinion or feeling conveyed through a text or
document, as well as how that opinion is classified as positive or negative. Sentiment analysis seeks to evaluate various
aspects in standard language to help an institution or company understand positive and negative opinions regarding
the products they provide (Tuhuteru & Iriani, 2018).
The sentiment itself can be interpreted as an emerging concept in which everyone's different emotions are
determined by the content of the text so that it can be processed to extract the opinions and sentiments of many people.
In sentiment analysis, three views can guide agencies or companies to obtain information about the products' quality:
positive, negative, and neutral (Klyueva, 2019).
Sentiment analysis is a new section of research in Natural Language Processing (NLP) that aims to find
subjectivity in texts or documents to classify opinions or sentiments. Three techniques are generally applied in the
sentiment classification method: Machine Learning, lexicon-based, and Hybrid Approach. Today, sentiment analysis
often uses Machine Learning techniques because of the method's ability to predict sentiment polarity based on prepared
data.
Dataset Collection
In performing sentiment analysis, data were collected from a review of the ChatGPT app on the Google Play
Store. Data retrieval uses scraping techniques with Python libraries using Google Play Scraper. The data for this
sentiment analysis is 2652 text reviews with the latest or most recent sorting reviews for the last 6 months, from July
28, 2023, to January 28, 2024.
Term Weighting
In this method, each word in the review will be given a weight or rating based on its significance in context. In
other words, this method converts text into numbers that represent values. The technique used is TF-IDF (Term
Frequency-Inverse Document Frequency), which combines the frequency of the term (F) and the presence of a term in
the view that is irreva to the topi (IDF) [1]. The loin orla does the value of T each word.
𝑁𝑚𝑏𝑒𝑟𝑓 𝑤𝑜𝑟𝑑𝑠 𝑡𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
𝑡𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑 𝑐𝑜𝑢𝑛𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢 𝑑
(1)a
Number of documents in the corpus
number of documents in corpus d containing the word t
) (2)
1197 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204
After the TF and IDF values are obtained with the previous formula, TF-IDF can be obtained with the
formula below.
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D) (3)
3. Results and Discussions
In the initial phase, the study began with a dataset of 2652 review data collected from July 28, 2023, to
January 28, 2024.
Figure 2 Dataset Collection
After successfully collecting the dataset, the next step is to clean the data, such as removing emojis,
numbers, and punctuation marks and changing uppercase letters to lowercase.
Gambar 3 Data Cleansing and Case Folding
Furthermore, tokenising or separating text-type data into per word is carried out.
Figure 4 Tokenizing a Dataset
1198 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:
The final stage is removing words that have no effect and the removal of affixes in words.
Figure 5 Stopword Removal and Stemming
Researchers conducted sentiment analysis using Google Colab and Python programming language. The
study was conducted on 2652 data, with 2326 data labelled as positive (87.71%) and 326 as negative (12.29%).
Researchers divided the data into training data as much as 70% (1,856 reviews) and testing data as much as
30% (796 reviews).
Figure 6 Sentiment chart before in SMOTE
An imbalance in the amount of data between positive reviews and negative reviews can result in an
imbalance of data that can lead to errors in classifying minority classes that tend to be majority classes.
Therefore, researchers use oversampling to balance data by adding data in minority classes. One of the
oversampling methods used is the Synthetic Minority Oversampling Technique (SMOTE), which deals with
unbalanced data problems or overfitting problems (Utami, 2022).
1199 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204
Figure 7 Sentiment Chart after in SMOTE
Figure 7 is a form of the dataset that has been in SMOTE by obtaining a balanced amount of data, namely
2326 positive and 2326 negative data. After that, researchers enter the data into each classification algorithm
and get the following results.
Random Forest
Table 1
Random Forest melalui Confusion Matrix
Negative
Positive
Negative
31
63
Positive
6
696
Researchers used the Scikit-Learn library to apply Random Forest classification to data. The analysis
showed that out of 796 cases predicted to be positive, 696 (or 87.43%) were completely positive (True
Positive), indicating that the model had high accuracy in identifying positive cases. In addition, of the 796
instances predicted negative, 31 (or 3.89%) were negative (True Negative), illustrating the model's ability to
identify negative instances correctly. On the other hand, 63 cases (7.91%) were incorrectly predicted as positive
when, in fact, they were negative (False Positive), indicating an error in classifying these cases. In comparison,
only 6 cases (0.75%) were incorrectly predicted as negative when they were positive (False Negative),
indicating that the model may tend to ignore some positive cases (Oktavia, Ramadahan, & Minarto, 2023).
1200 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:
Table 2
Results with Random Forest
Recall
Negative
0.33
Positive
0.99
Accuracy: 0.91
F1-Score: 0.90
Table 2 shows the results of classification using the random forest algorithm. 91% accuracy indicates
how well the model classifies all data correctly. The 90% F1-score is a combined measure of precision and recall,
with a precision class negative of 84% and a precision class positive of 92%, describing the model's accuracy in
classifying each class. Meanwhile, recall class negative reached 33% and recall class positive reached 99%,
indicating the model's ability to identify negative and positive sentiments specifically. These results show that
the recall value of the positive class is much higher than that of the recall class negative, indicating that the
model is superior in recognising and classifying positive sentiment (Dey, Chakraborty, Biswas, Bose, & Tiwari,
2016).
Support Vector Machine
Table 3
Support Vector Machine melalui Confusion Matrix
Negative
Positive
Negative
34
60
Positive
11
691
Researchers used the Scikit-Learn library to apply Support Vector Machine classification to data. The
results indicate that out of 796 cases predicted to be positive, 691 (or 86.80%) are positive (True Positive),
demonstrating the model's accuracy in identifying positive cases. In addition, of the 796 cases predicted to be
negative, 34 (or 4.27%) were negative, illustrating the model's ability to identify negative instances correctly.
On the other hand, 60 cases (7.53%) were incorrectly predicted as positive when they were negative (False
Positive), indicating an error in classifying these cases. In comparison, only 11 cases (1.38%) were incorrectly
predicted as negative when positive (False Negative), indicating that the model may ignore some positive cases.
Table 4
Results with Support Vector Machine
Recall
Negative
0.36
Positive
0.98
Accuracy: 0.91
F1-Score: 0.90
1201 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204
Table 4 shows the classification results using the support vector machine algorithm. 91% accuracy
indicates how well the model classifies all data correctly. The 90% F1-score is a combined measure of precision
and recall, with a precision class negative of 76% and a precision class positive of 92%, reflecting the model's
accuracy in classifying each class. Meanwhile, recall class negative reached 36% and recall class positive
reached 98%, demonstrating the model's ability to identify negative and positive sentiments specifically. A
higher positive class recall value than negative class recall indicates that the dominant model can recognise and
classify positive sentiment well.
Naïve Bayes Classifier
Table 5
Naïve Bayes through the Confusion Matrix
Negative
Positive
Negative
16
78
Positive
1
701
Researchers used the Scikit-Learn library to implement the classification of Naïve Bayes with
Multinomial Naïve Bayes types, specifically designed for multinomial distributions such as text data
represented in the form of TF-IDF. The analysis showed that of the total 796 data predicted positive, as many
as 701 (or 88.06%) were True Positive (TP), illustrating the model's ability to identify positive cases correctly.
In addition, out of a total of 796 data predicted to be negative, only 16 (or 2.01%) were True Negative (TN),
demonstrating the model's ability to classify negative cases correctly. However, there were 78 data (or 9.79%)
that were incorrectly predicted as positive when in fact they were negative (False Positive), and only 1 data (or
0.12%) was incorrectly predicted as negative when in fact it was positive (False Negative), indicating some
errors in classification.
Table 6
Results with Naïve Bayes
Recall
Negative
0.17
Positive
1.00
Accuracy: 0.90
F1-Score: 0.87
Table 6 shows the results of classification with the naïve Bayes algorithm. 90% accuracy indicates how
well the model classifies all data correctly. The 87% F1-score is a combined measure of precision and recall,
with a precision class negative of 94% and a precision class positive of 90%, illustrating the model's accuracy
in classifying each class. Meanwhile, recall class negative reached 17% and recall class positive reached 100%,
indicating the model's ability to identify negative and positive sentiments specifically. A positive recall class
value that achieves a perfect score suggests that the dominant model can recognise and classify positive
sentiments well.
The following is a combination of the results of each algorithm classification regarding the sentiment
data analysis method that has been carried out.
1202 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:
Table 7
Overall algorithm classification results
Algoritma
Acc
ura
cy
Positive
Negative
Confusion Matrix
F1-
Score
Precisi
on
Recal
l
Preci
sion
Recal
l
TP
FN
FP
TN
Random
Forest
91
%
92%
99%
84%
33%
696
31
63
6
90%
Support
Vector
Machine
91
%
92%
98%
76%
36%
691
34
60
11
90%
Naïve
Bayes
90
%
90%
100
%
94%
17%
701
16
78
1
87%
Word Cloud
A word cloud is a visual representation of text, where the font size signifies how often the word appears. Here
is a word cloud that visualises data with their respective sentiment labels. Figure 10 shows a word cloud with a positive
sentiment, while Figure 11 shows a negative sentiment.
Figure 10 Positive Word Cloud Figure 11 Negative Word Cloud
Figure 10 shows words that show positive sentiment results, such as the words "help", "good", "cool",
"good", "thank you", "steady", "accurate", which means that most positive reviews are interested in the launch
of ChatGPT which is a new thing. Figure 11 shows the results of negative sentiments such as the words "please",
"wrong", "error", "login", "accurate", "different", and "answer", which means there are several reviews that
contain their dissatisfaction with the presence of ChatGPT (Farid, Enri, & Umaidah, 2021).
4. Conclusion
Based on this study, from a total of 2652 review data on ChatGPT performance from July 28, 2023, to
January 28, 2024, it was found that as many as 2326 (87.71%) reviews were positive, while 326 (12.29%)
reviews were negative. This shows that people tend to respond positively to using ChatGPT based on ratings on
the Google Play Store. In this study, researchers used the f1-score as the best evaluation method because the
data was imbalanced, and the f1-score was considered the best way to measure how well the model handled
minority classes. Through the classification of three different algorithms using testing data as much as 796
(30%) from a total of 2652 reviews, it was found that Random Forest obtained an f1-score value of 90% with
positive correct data of 87.43% and negative correct data of 0.75%, Support Vector Machine got an f1-score
value of 90% with positive, accurate data of 86.80% and negative valid data of 0.13%. Naïve Bayes received an
1203 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, March-April 2024, pages: 1194-1204
f1-score of 87% with positive correct data of 88.06% and negative correct data of 0.12%. The results show that
the Naïve Bayes classification algorithm has analytical capabilities under the Support Vector Machine and
Random Forest, which has the model's ability to handle data more accurately, thus giving an equally high f1-
score value in this sentiment analysis. Overall, the community responded to the use of the ChatGPT application
with positive responses. Based on the level of accuracy obtained, it is concluded that the public's response to
the ChatGPT application tends to be positive, which is reflected in the many positive comments given to ChatGPT
on the Google Play Store.
1204 e-ISSN: 2980-4108 p-ISSN: 2980-4272 IJEBSS
IJEBSS Vol. 2 No. 04, Merch-April 2024, pages:
5. References
Adrian, Muhammad Rivza, Putra, Muhammad Papuandivitama, Rafialdy, Muhammad Hilman, & Rakhmawati,
Nur Aini. (2021). Perbandingan Metode Klasifikasi Random Forest dan SVM Pada Analisis Sentimen
PSBB. Jurnal Informatika Upgris, 7(1). https://doi.org/10.26877/jiu.v7i1.7099
Dey, Lopamudra, Chakraborty, Sanjay, Biswas, Anuraag, Bose, Beepa, & Tiwari, Sweta. (2016). Sentiment
analysis of review datasets using naive Bayes and k-nn classifier. ArXiv Preprint ArXiv:1610.09982.
Farid, Farid, Enri, Ultach, & Umaidah, Yuyun. (2021). Sistem Pendukung Keputusan Rekomendasi Topik Skripsi
Menggunakan Naïve Bayes Classifier. JOINTECS (Journal of Information Technology and Computer
Science), 6(1), 3542.
Fernández-Gavilanes, Milagros, Álvarez-López, Tamara, Juncal-Martínez, Jonathan, Costa-Montenegro, Enrique,
& González-Castaño, Francisco Javier. (2016). Unsupervised method for sentiment analysis in online
texts. Expert Systems with Applications, 58, 5775.
Fitri, Evita. (2020). Analisis Sentimen Terhadap Aplikasi Ruangguru Menggunakan Algoritma Naive Bayes,
Random Forest Dan Support Vector Machine. Jurnal Transformatika, 18(1), 7180.
https://doi.org/10.26623/transformatika.v18i1.2317
Hasibuan, Ernianti, & Heriyanto, Elmo Allistair. (2022). Analisis Sentimen Pada Ulasan Aplikasi Amazon
Shopping Di Google Play Store Menggunakan Naive Bayes Classifier. Jurnal Teknik Dan Science, 1(3), 13
24. https://doi.org/10.56127/jts.v1i3.434
Klyueva, Irina. (2019). Improving the quality of the multiclass SVM classification based on feature engineering.
2019 1st International Conference on Control Systems, Mathematical Modelling, Automation and Energy
Efficiency (SUMMA), 491494. IEEE.
Muslimin, Muhammad, & Lusiana, Veronica. (2023). Analisis Sentimen Terhadap Kenaikan Harga Bahan Pokok
Menggunakan Metode Naive Bayes Classifier. JURNAL MEDIA INFORMATIKA BUDIDARMA, 7(3), 1200
1209.
Oktavia, Dea, Ramadahan, Yudhi Raymond, & Minarto, Minarto. (2023). Analisis Sentimen Terhadap Penerapan
Sistem E-Tilang Pada Media Sosial Twitter Menggunakan Algoritma Support Vector Machine (SVM).
KLIK: Kajian Ilmiah Informatika Dan Komputer, 4(1), 407417.
Prayoginingsih, Sila, & Kusumawardani, Renny Pradina. (2018). Klasifikasi Data Twitter Pelanggan
Berdasarkan Kategori myTelkomsel Menggunakan Metode Support Vector Machine (SVM). Jurnal Sisfo,
7(02), 8398.
Ratnawati, Luthfiana, & Sulistyaningrum, Dwi Ratna. (2020). Penerapan random forest untuk mengukur tingkat
keparahan penyakit pada daun apel. Jurnal Sains Dan Seni ITS, 8(2), A71A77.
Rifaldi, Muhamad Ilmar, Ramadhan, Yudhi Raymond, & Jaelani, Irsan. (2023). Analisis Sentimen Terhadap
Aplikasi Chatgpt Pada Twitter Menggunakan Algoritma Naïve Bayes. J-SAKTI (Jurnal Sains Komputer Dan
Informatika), 7(2), 802814. https://doi.org/10.30645/j-sakti.v7i2.687
Setiawan, Adi, & Luthfiyani, Ulfah Khairiyah. (2023). Penggunaan ChatGPT untuk pendidikan di era education
4.0: Usulan inovasi meningkatkan keterampilan menulis. JURNAL PETISI (Pendidikan Teknologi
Informasi), 4(1), 4958.
Tuhuteru, Hennie, & Iriani, Ade. (2018). Analisis Sentimen Perusahaan Listrik Negara Cabang Ambon
Menggunakan Metode Support Vector Machine dan Naive Bayes Classifier. Jurnal Informatika: Jurnal
Pengembangan IT, 3(3), 394401.
Utami, Herni. (2022). Analisis Sentimen dari Aplikasi Shopee Indonesia Menggunakan Metode Recurrent Neural
Network. Indonesian Journal of Applied Statistics, 5(1), 3138.