Image Classification of Room Tidiness Using VGGNet with Data Augmentation

.


I. INTRODUCTION
Tidiness is the quality of an object, place, or situation being tidy, orderly, and clean.It encompasses various aspects, such as room cleanliness, layout, etc [1].Ensuring tidiness involves several important factors, as mentioned in the book "The 5S: Five Keys to a Total Quality Environment."For example, individuals must sort out necessary and unnecessary items, and tidiness is assessed based on well-organized belongings [2].By implementing these aspects, tidiness will create a comfortable environment and enhance productivity.
One fundamental aspect of creating a comfortable environment is the cleanliness and tidiness of a room.An article titled "What a Messy Room Says About You" highlights how a room's condition can impact an individual's health and mental well-being.A cluttered room can cause stress and anxiety and hinder productivity and creativity [3].The tidiness of a room should be maintained at all times and in various settings, such as office spaces, hotel rooms, and individual rooms within homes.Moreover, maintaining cleanliness and tidiness is particularly important in smart buildings, where technological advancements and intelligent systems enable efficient and sustainable management to create a comfortable environment [4].
The issue of maintaining tidiness in places or rooms is not only relevant to individuals but also affects businesses.Business owners need to prioritize and emphasize the tidiness of their establishments.However, they may encounter challenges when managing multiple business locations, as they must constantly exert direct control to ensure each place remains hygienic, orderly, and clean.This challenge becomes even more complex for owners in the hospitality industry, where housekeeping personnel or business owners themselves need to inspect each hotel room individually to ensure they are clean and contribute to customer comfort [5]; additionally, during an interview session with a hotel team member at Fave Hotel, located at Jl. Cimanuk No. 338, Tarogong Kidul District, Garut Regency, it was revealed that checking room conditions is still manually.General cleaning or housekeeping staff must ensure that each room is tidy and clean.
Given the previous problem, particularly in the hotel industry, there is a need for a solution in the form of a model or system that can easily classify whether a room is tidy or not.One viable solution is leveraging technological advancements that have gained momentum in recent years, specifically Deep Learning (DL).Deep learning is a subfield of Machine Learning (ML) inspired by how the human brain processes information patterns.DL does not rely on explicitly designed rules but utilizes much available data to map the input to specific labels [6].With DL, a previously manual task can be automated more effectively, and in addressing the classification of room tidiness, one can employ a Convolutional Neural Network (CNN), a method within DL designed explicitly for image classification.
CNN is a family of algorithms and models within Deep Learning for object recognition.The development of CNN dates back to the 1990s when LeCun and his colleagues proposed a novel Neural Network (NN) architecture for classifying handwritten digits from images, which was published at the Neural Information Processing Systems (NeurIPS) conference [7].CNN encompasses several architectures that can process image data, such as feature detection, LeNet Architecture, AlexNet Architecture, ResNet Architecture, DenseNet Architecture, VGGNet Architecture, and GoogleNet Architecture [8].As a solution to the previous challenge, VGGNet is a widely popular architecture that has been extensively used for image classification and has proven highly effective [9].
Furthermore, in Deep Learning, the quantity of data plays a crucial role in learning patterns.If the utilized data is more considerable, the likelihood of the model experiencing overfitting significantly increases.Overfitting occurs when the model performs exceptionally well on the training data but fails to generalize well on testing data [6].Data augmentation techniques can be employed to address the issue of overfitting.Data augmentation is a method used to increase the number of training samples by generating new variations of existing training data [10].
This research is based on previous studies.The first study focused on building a model for clean room detection using Support Vector Machine and Neural Network algorithms [11].The second study performed a hotel room classification system using CNN with VGGNet architecture and achieved an accuracy of 92.92% [5].The third study conducted image classification based on object categories, obtaining an accuracy of 93.57% with 18 layers [12].The fourth study addressed detection in image classification for diabetic retinopathy using a Neural Network with VGGNet-16 architecture, resulting in an accuracy of 93.73% [13].The fifth study conducted image classification for Garutan batik using CNN and data augmentation, resulting in the highest accuracy of 91.30% with the ResNet-50 architecture [14].The last study discussed medical image classification for disease diagnosis using a Convolutional Neural Network with VGGNet architecture and achieved an accuracy of 92.3% [15].This research aims to create a modeling framework to serve as a basis for further model development.The application of this model can be utilized in the hospitality industry to enhance job performance and implement the VGGNet architecture, which plays a significant role in the image classification model.This research represents the original work and creativity of the researcher, including the utilization of Convolutional Neural Network, the implementation of Deep Learning methods for image processing, and the incorporation of data augmentation, which adds uniqueness to the study.The resulting model can be employed to develop more complex and practical models.

II. METHOD
Based on the previous problem, the researcher has established a framework to address the issue.The research framework is outlined as follows in Fig. 1.Fig. 1 can be explained as follows : 1) Initial Identification: In the initial identification phase, a literature review was conducted on relevant studies about image classification using Convolutional Neural Networks.During the literature review process, previous research was assessed, the issues addressed in prior studies were identified, and discussions were engaged with the supervisor to obtain feedback and suggestions regarding the proposed research.Additionally, an interview was conducted with a hotel industry team member to strengthen the research further.
2) Data Collection: In the data collection phase, the researcher gathered relevant data for the study and analyzed the collected dataset to enable classification.The data collection process focused on obtaining a dataset consisting of room images.Subsequently, an analysis was conducted, preparing for the subsequent pre-processing stage.
3) Pre-processing: The pre-processing stage involves the normalization and data augmentation of the dataset obtained during the data collection phase.The normalization process is carried out to transform the image data into a format easily interpretable by the machine or model.Once the normalization stage is completed, the data augmentation process introduces variations to the training data samples and prevents overfitting.The image data obtained was taken from several existing sources.Apart from that, the image data was obtained using electronic data collection: Here are some images of the neat and messy room dataset taken from Kaggle by a user named Guanqiao Ding, which can be seen in Fig. 2.  Historical data from several free image provider websites, namely Unsplash, Pixabay, Flickr, Wikimedia Commons, Rawpixel, and Pexels, were used.Some of these sources are image sources for neat and messy rooms.Some sites can be visited at the following addresses: Unsplash (https://unsplash.com).
Pixabay (https://pixabay.com).Flickr (https://flickr.com).Wikimedia Commons (https://commons.wikimedia.org).Rawpixel (https://rawpixel.com).Pexels (https://pexels.com).4) Classification: The researcher performed classification on the pre-processed dataset.Subsequently, the implementation of the VGGNet architecture in Fig.  CNN serves as a fundamental architecture.By applying convolution filters, CNN is designed to automatically and adaptively learn hierarchies of features from input images.These filters detect various patterns, such as edges and textures.VGGNet, a specific CNN Architecture, gained prominence for its simplicity and effectiveness.It comprises a series of convolutional layers followed by poling layers, with increasing depth as the network progresses.

1) Testing:
The final stage involves testing and evaluating the constructed model using a confusion matrix and several other metrics: recall, precision, and F1-score.

A. Initial Identification
In the initial identification phase, several processes were conducted to obtain insights into image classification using Convolutional Neural Network (CNN).It began with identifying the subject of research, which is the cleanliness of a room.Subsequently, the focus shifted to obtaining the research problem through a literature review, which allowed for a more targeted approach to the research topic.
After conducting a literature review, the researcher conducted an interview session with a team member of Fave Hotel located at Jln: Cimanuk No. 338, Tarogong Kidul sub-district, Garut district, namely Muhammad Bagja Sukriyansah.The interview concluded that checking rooms' cleanliness is still manually, where general cleaning and housekeeping staff need to ensure that each room is tidy and clean.Additionally, general cleaning staff must ensure that all objects and items, such as bathroom amenities and bedding accessories in the bedroom, are placed in their designated locations.Moreover, general cleaning staff need to inspect every corner of the room meticulously.After completing this process, they must also inform the front office that the room is ready.

B. Data Collection 1) Image Data Collection:
The first process is data collection, which will be gathered according to the research theme of room tidiness.The collected data consists of data from various rooms, such as bedrooms, bathrooms, toilets, living rooms, kitchens, and workspaces.Two data types are to be collected: tidy rooms and messy rooms.The image data will be obtained from several sources, including Kaggle, Unsplash, Pixabay, Flickr, Wikimedia Commons, Rawpixel, and Pexels.Based on the previous sources, the total of image data obtained for data collection is 1130, 565 images categorized as "Tidy room" and 565 images categorized as "Messy room." 2) Data Analysis: In the data analysis phase, the obtained room image data will be divided into several subsets to train the image classification model.From the collected data, which consists of tidy and messy rooms with a total of 2 labels, the tidy room data contains 565 images.In contrast, the messy room data contains 565 images, resulting in a total of 1130 images when combined.
The data will be divided into several parts, namely training data, validation data, and testing data.This research adjusts the ratio used for Deep Learning according to the needs.The researcher has chosen a specific ratio for training and validation data, which is divided as follows: 90% for training and validation data, further divided into 60% for training data, totaling 610 images, and 40% for validation data, totaling 407 images.As for the testing data, it comprises 10% of the total data, which is 113 images.This ratio is visually represented in Fig. 5.

C. Pre-processing
After the data analysis stage, the divided dataset will enter the pre-processing process, including data normalization and augmentation.Preparing the data before entering the model training phase is essential.The data normalization ensures that the data is transformed into a standardized format suitable for training.On the other hand, data augmentation generates variations of the training data to increase its diversity and prevent overfitting.A detailed description of each pre-processing step can be found in Fig. 6.As depicted in the image, the pixels in each part of the image have values ranging from 0 to 255.By applying data normalization, each pixel value is transformed to have a consistent scale ranging from 0 to 1. Data normalization enables improved model performance.An example of the resulting normalized image data can be seen in Fig. 7.
2) Data Augmentation: The use of data augmentation techniques has a significant impact on training data.By employing these techniques, variations of the original images are created to generate new image data.This helps to increase the diversity and quantity of training data available to train the model, for example, by utilizing techniques such as rotation, zoom range, width shift range, height shift range, shear range, and horizontal flip.Therefore, using data augmentation techniques results in more variants of training data with various positions and conditions, which in turn helps improve the performance and generalization of the model in image classification tasks.

Fig. 7 Normalized data results
The rotation range is the first augmentation parameter, where the images are rotated.An example of the augmentation result from the rotation range can be seen in Fig. 8.
The zoom range is the second augmentation parameter, where the images are enlarged.An example of the augmentation result from the zoom range can be seen in Fig. 9.The width shift range is the third augmentation parameter, where the images undergo horizontal shifting.An example of the augmentation result from the width shift range can be seen in Fig. 10.
An example of the augmentation result from the height shift range can be seen in Fig. 11.The shear range is the fifth augmentation parameter, where the images are sheared clockwise, resulting in a parallelogram-like shape when used.An example of the augmentation result from the shear range can be seen in Fig. 12. Horizontal flip is the sixth augmentation parameter, where the images are horizontally flipped.An example of the augmentation result from the horizontal flip can be seen in Fig. 13.
Height shift range is the fourth augmentation parameter, where the images undergo vertical shifting.I.As shown in Table I, it can be observed that the model stopped at epoch 29 as it reached the predefined target specified by the callback.In each epoch, the model experienced an increase in accuracy and a decrease in loss.Higher accuracy indicates better prediction performance of the model, while lower loss signifies improved learning of patterns within the data.Minimizing the loss aims to reduce errors during the model training process.However, accuracy and loss values are not the sole criteria for evaluating the training results.Graphical representations of the training and validation data from the previous training session can be analyzed to assess the model's performance.The training results are depicted in Fig. 16.In Fig. 16, it can be observed that the training results indicate optimal and stable performance.For instance, in terms of model accuracy, the gap between accuracy and validation accuracy across multiple epochs is small.This signifies that the model consistently performs well on the training and validation datasets.Similarly, the difference between loss and validation loss throughout the training epochs is slight for model loss and remains stable.This indicates that the model effectively minimizes errors and maintains a consistent performance.

E. Testing
Testing is the final phase in this research, where the trained model undergoes evaluation using several commonly used metrics to assess its performance.This evaluation determines whether the model is ready for further development or application and whether users can publish and use it.
Throughout testing, the model is presented with images or data samples from the test dataset that have not been encountered during the training phase.Subsequently, the model generates predictions or classifications based on these previously unseen data samples.These predictions are then juxtaposed against the test data's actual labels or ground truth values to assess the model's performance.
The first test involves using images of tidy rooms.The obtained results from the model predictions can be observed in Fig. 17.The second test involves using images of messy rooms.The obtained results from the model predictions can be observed in Fig. 18.After conducting direct testing on the image data, the final step is to examine the model's performance using a confusion matrix, which is one of the ways to evaluate the model.The evaluation of the model based on the confusion matrix can be represented in the form of Fig. 19.
In Fig. 19, it can be observed that the model accurately predicted all 113 data instances.Precisely, 57 images of tidy rooms were correctly predicted without any errors, and the remaining 56 images of messy rooms were also correctly predicted as messy rooms with no prediction errors.In addition to the confusion matrix, the model's performance is evaluated using precision, recall, and f1-score.The performance metrics, including accuracy, precision, recall, and f1-score, can be seen in Fig. 20.In the tidy room category, the precision achieved 100%, recall achieved 100%, and the f1-score also reached 100%.Similarly, in the messy category, precision achieved 100%, recall achieved 100%, and the f1-score also reached 100%.These metrics were calculated manually and can be summarized as follows.
The accuracy is calculated as the sum of True Negatives (TN) and True Positives (TP) divided by the total number of samples.It's 57 (TN) + 56 (TP) divided by 57 (TN) + 0 (FP) + 0 (FN) + 56 (TP), resulting in 113/113, which equals 1.00 or 100%.The accuracy score of 100% indicates that the model achieved good classification performance on the dataset used for evaluation.
Messy room evaluation.Precision: measures the proportion of correctly identified positive cases among all cases classified as positive.Precision is calculated to be 1.00 or 100%.This means all the samples classified as messy rooms are correctly identified as messy.Recall: Recall (also known as sensitivity) measures the proportion of actual positive cases the model correctly identified; recall is calculated as 1.00 or 100%.This means the model correctly identifies all the messy rooms in the dataset.F1-score: the harmonic mean of precision and recall, providing a single metric that balances precision and recall.The F1-score is calculated to be 1.00.This indicates a perfect balance between precision and recall, suggesting good performance in identifying messy rooms.
The results of the tidy room evaluation are similar to those of the messy room evaluation.Precision: signifies the accuracy of identifying tidy rooms among all cases classified as tidy.It's determined to be 1.00 or 100%, indicating that every sample classified as a tidy room is correctly identified.Recall: also known as sensitivity, assesses the model's ability to correctly identify actual positive cases, in this case, tidy rooms.With a calculated recall of 1.00 or 100%, the model accurately identifies all tidy rooms within the dataset.F1-score: representing the harmonic mean of precision and recall, offers a unified metric balancing both aspects.A calculated F1score of 1.00 suggests a harmonious blend of precision and recall, indicating strong performance in identifying tidy rooms.

IV. CONCLUSION
Based on the findings of this research, it can be concluded that the Convolutional Neural Network (CNN) method, combined with data augmentation techniques, can be used for image classification of room tidiness.Data augmentation plays a crucial role in this context, as the model requires a large amount of data to learn from, and the available data must be abundant.Therefore, data augmentation provides the model with increased variations in the image data, preventing overfitting and leading to improved performance.This research has successfully produced a model capable of recognizing room tidiness using the CNN architecture, specifically VGG-16, with an accuracy of 98.44%.In summary, this research contributes to the development of image classification.The results can be further developed and applied in the hospitality industry, particularly in assisting housekeeping or general cleaning personnel in assessing the readiness of rooms for customers.However, it should be noted that this research still has room for improvement.The scope of this study was limited to the testing phase, and additional data is needed to enhance the model's ability to learn from a broader range of room tidiness patterns.Furthermore, future developments could include designing and implementing software for direct user interaction or exploring alternative CNN architectures beyond VGGNet.

Fig. 1
Fig. 1 Stages of the method carried out 3 and the training process were initiated.Implementing the VGGNet architecture is a crucial component of the Convolutional Neural Network (CNN), as shown in Fig. 4. Once constructed, the architecture underwent training, resulting in a model ready for testing and image classification.

Fig. 5
Fig. 5 Ratio for the division of image data

Fig. 17
Fig. 17 Tidy room model prediction results