کاربرد تئوری آنتروپی و تحلیل مؤلفه اصلی جهت تعیین متغیرهای ورودی تخمین تابش خورشیدی با الگوریتم های یادگیری ماشین (مقاله علمی وزارت علوم)
درجه علمی: نشریه علمی (وزارت علوم)
آرشیو
چکیده
تابش خورشیدی به عنوان یکی از متغیرهای مهم در مدل های بیلان انرژی و شبیه سازی رشد گیاهان اهمیت زیادی دارد. در این پژوهش عملکرد روش تحلیل مؤلفه اصلی (PCA) و تئوری آنتروپی شانون (ENT) برای تعیین ورودی مدل های یادگیری ماشین جنگل تصادفی (RF)، رگرسیون خطی (LR)، ماشین بردار پشتیبان (SVR)، نزدیک ترین همسایه (KNN)، درخت تصمیم (DT) و (XGB) XGBoost در برآورد تابش خورشیدی در ایستگاه سینوپتیک یزد در حد فاصل سال های 2006 تا 2023 موردبررسی قرار گرفت. متغیرهای میانگین دما، دمای کمینه، دمای بیشینه، ساعات آفتابی، رطوبت نسبی و تابش خورشیدی به صورت روزانه از سازمان هواشناسی دریافت و متغیرهای تابش فرازمینی، فاصله نسبی زمین تا خورشید، زاویه میل خورشیدی و حداکثر ساعات آفتابی با روابط موجود محاسبه و به عنوان ورودی روش های پیش پردازش انتخاب شدند. نتایج الگوریتم های یادگیری ماشین حاکی از دقت قابل قبول آن ها در تخمین تابش خورشیدی بود. با کاهش بعد داده های ورودی به الگوریتم های یادگیری ماشین، نتایج نشان داد که روش تحلیل مؤلفه اصلی دقت مدل را افزایش داد و در بین مدل های به کاررفته، مدل PCA-SVR با ضریب تبیین 923/0 و دقت 84/92% بهترین نتیجه را در ایستگاه یزد نشان داد. لازم به ذکر است که روش تئوری آنتروپی شانون نتوانست نتایج مدل سازی را نسبت به روش بدون پیش پردازش اولیه بهبود بخشد. این تحلیل نشان می دهد که استفاده از تکنیک های کاهش ابعاد و انتخاب مدل های مناسب می تواند منجر به دقت بیشتر و پیچیدگی محاسباتی کمتر در مسائل پیش بینی شود، هرچند در انتخاب مدل پیش پردازش داده های اولیه باید دقت کافی داشت.Application of Entropy Theory and Principal Component Analysis to Determine Input Variables for Estimating Solar Radiation using Machine Learning Algorithms
                            
                                
Solar radiation is crucial in energy balance models and plant growth simulations. This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy Theory (ENT) in determining the input for machine learning models – Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) – for estimating solar radiation at the Yazd synoptic station between 2006 and 2023. Daily data for average temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, the relative Earth-Sun distance, solar declination angle, and maximum sunshine hours were calculated using existing formulas and selected as inputs for the pre-processing methods. The results of machine learning algorithms indicated their acceptable accuracy in estimating solar radiation. By reducing the dimensionality of the input data to the machine learning algorithms, the results showed that the Principal Component Analysis (PCA) method increased the model's accuracy. Among the models used, the PCA-SVR model showed the best result at the Yazd station with a coefficient of determination of 0.923 and an accuracy of 92.84%. It is worth mentioning that the Shannon entropy theory method failed to improve the modeling results compared to the method without initial pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to greater accuracy and less computational complexity in prediction problems. However, sufficient care should be taken when selecting a pre-processing model for the initial data.
Extended 
Introduction
In terms of selecting all influential parameters and the lack of statistical information, the complexity of meteorological and hydrological systems makes complete modeling of these systems impossible. Using system modeling based on mathematical relationships is of interest in such conditions. Solar radiation is one of the important and effective meteorological variables in estimating evapotranspiration and the water needs of plants, and it is the energy source for all atmospheric and surface processes. Although the measurement of this variable has a relatively long history in Iran, due to the high costs of measuring instruments, many existing stations in the country lack a radiometer or pyranometer, or face issues such as calibration problems and the accumulation of water and dust on the sensor. Even at weather stations that measure radiation, there are days when radiation data is not recorded, or unrealistic values outside the expected range are observed due to equipment malfunctions or other issues. On the other hand, due to the many factors affecting solar radiation studies, it is impossible to include all elements in the relevant equations. As a result, only a limited number of these variables are applicable for estimating solar radiation using empirical and semi-empirical equations. In recent years, many researchers have focused their studies on using data mining methods and mathematical modeling to estimate solar radiation.
 
Methodology
The data used in this research are daily climatic variables measured at the Yazd synoptic station from 2006 to 2023. The Yazd station is located at 31.8974° North latitude and 54.3569° East longitude, at an altitude of 1216 meters above sea level. The average solar and extraterrestrial radiation at the Yazd synoptic station are 19.35 and 32 megajoules per square meter per day. The ratio of sunshine hours to maximum possible sunshine hours is 0.75, the average relative humidity is 27%, and the average temperature is 28°C. Data from 2006 to 2014 were used for calibrating the equations, and data from 2015 to 2023 were used for evaluating the results. Extraterrestrial radiation and maximum daily sunshine hours, which depend on the geographical latitude and day number based on the Gregorian calendar, were calculated using the relationships presented by Duffie and Beckman (1991). This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy (ENT) for determining the input variables of Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) machine learning models in estimating solar radiation. Daily data for mean temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, relative earth-sun distance, solar declination angle, and maximum sunshine hours were calculated using existing relationships and selected as inputs for the preprocessing methods.
 
Results and discussion
Results showed that, in the training phase, the employed models were well-trained and exhibited acceptable results. In the testing phase, the modeling results for the raw input data (without pre-processing) also yielded satisfactory results for all models. The coefficient of determination varied between 0.790 for the KNN model and 0.893 for the SVR model, depending on the algorithms used. In other words, regarding R-squared values, all the algorithms used showed good results for solar radiation prediction. Considering all evaluation metrics, the Support Vector Regression (SVR) algorithm performed better than other models to predict solar radiation with RMSE = 1.732, MSE = 0.003, MAE = 0.826, R² = 0.893, and an accuracy of 90.75%. Results showed that using Principal Component Analysis (PCA) for dimensionality reduction, the first principal component accounted for approximately 49% of the variance, and the second principal component accounted for approximately 36%. The first two principal components comprised over 85% of the original data's variability; therefore, these two components were considered as inputs for the predictive models to estimate solar radiation. Based on the training results, the PCA-DT and ENT-DT models exhibited the best performance in solar radiation estimation and model training at the Yazd station, achieving zero mean squared error and mean absolute percentage error, and a coefficient of determination of 1.00 compared to other models. The results of the model testing section indicate that the PCA-SVR model outperforms other methods. As can be seen, the PCA-SVR model, with a coefficient of determination of 0.923 and an accuracy of 92.84%, achieved the best results among the mentioned models at Yazd station, exhibiting the lowest error metrics. The ENT-DT model, with a coefficient of determination of 0.535 and an accuracy of 79.34%, showed weaker results among the models used at Yazd station.
 
Conclusion
Given the importance of accurate solar radiation estimation in hydrological phenomena and the need for advanced methods in its estimation, this research utilized Principal Component Analysis (PCA) and entropy theory for data pre-processing.  Model inputs for the estimation models were identified using these two methods. Modeling was performed using Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) models.  Entropy theory results indicated that at the Yazd station, solar declination angle, minimum temperature, minimum relative humidity, and average relative humidity were effective variables in estimating solar radiation.  Furthermore, PCA reduced the number of input variables to two principal components, and modeling was performed using these two derived input variables.  Overall, the modeling results showed that the PCA-SVR model outperformed other models in estimating solar radiation.  In general, PCA pre-processing demonstrated that this method determines better inputs for the estimation models. It is worth noting that Shannon's theoretical method did not improve the modeling results compared to the method without pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to higher accuracy and lower computational complexity in prediction problems. However, care must be taken when selecting the pre-processing model for the initial data. Similar research using new data or in different geographical conditions could also help further validate the results.
 
Funding
There is no funding support.
 
Authors’ Contribution
In this study, the authors' contributions are as follows: Somayeh Soltani-Gardfaramarzi was responsible for the study design, data collection, analysis, writing the initial draft, and final editing of the article, and Mojgan Askarizadeh was responsible for modeling and results.
 
Conflict of Interest
Authors declared no conflict of interest.
 
Acknowledgments
We are grateful to all the scientific consultants of this paper.
                            
                        
                        







