LLM 기반 시계열 예측 하이퍼파라미터 자동 튜닝 프레임워크: 서울시 대기질 데이터 사례연구
LLM-based Hyperparameter Tuning Framework for Time Series Forecasting: A Case Study on Seoul Air Quality Data
Article information
Trans Abstract
Purpose
This study proposes the LLM-based Hyperparameter Tuning for Time Series (LHTT) framework, an automated time series forecasting system utilizing large language models and validates its effectiveness through Seoul air quality (PM2.5) data analysis. The research aims to automate the entire process from model selection to hyperparameter optimization and result analysis using generative AI.
Methods
Five different time series forecasting models (Exponential Smoothing, ARIMA, Prophet, LSTM, Transformer) were implemented and compared using Seoul PM2.5 data from May 1 to May 31, 2025 (18,500 observations from 25 monitoring stations). Gemma3:27B was utilized for automated hyperparameter tuning through iterative feedback loops. Performance evaluation was conducted using MSE, RMSE, MAE, R², and MAPE metrics.
Results
Among baseline models, Transformer achieved the best performance with RMSE 4.08, R² 0.79, and MAPE 9.53%. However, after LLM-based tuning, the LSTM model achieved superior performance with RMSE 3.70, R² 0.82, and MAPE 9.71%, representing significant improvement over the baseline models. Statistical models also showed dramatic improvements after LLM tuning, with Exponential Smoothing achieving 87.85% reduction in RMSE.
Conclusion
The proposed LHTT framework demonstrates significant potential for improving time series forecasting accuracy while minimizing expert intervention. The automated system successfully generated comprehensive analysis reports and achieved practical prediction accuracy suitable for environmental policy applications, proving the feasibility of end-to-end automation in data science workflows.
1. Introduction
Time series forecasting serves as a core analysis technique across various industries, including finance, environment, energy, and logistics, etc. Air quality forecasting constitutes an essential component in environmental policy formulation and public health management, making the development of accurate forecasting models socially significant. However, existing time series analysis techniques exhibit limitations in their heavy reliance on high-level domain expertise and the substantial manual effort required throughout the process, from model selection to hyperparameter configuration.
Recent rapid advances in Large Language Models (LLMs), such as OpenAI's GPT-4, present new possibilities for addressing these limitations. LLMs demonstrate human-level natural language processing and complex problem-solving capabilities based on learning from vast parameters, and their applicability is increasingly expanding into time series data analysis domains that have traditionally depended on statistical methods or machine learning approaches.
Jin et al. (2024) identified the limitations of conventional time series models, which are constrained to specific forecasting tasks and heavily dependent on domain knowledge and sophisticated hyperparameter tuning, arguing that LLMs can introduce a new paradigm to time series forecasting. Against this background, this study proposes a system that comprehensively automates the development process of time series forecasting models through the utilization of generative AI.
The purpose of this study is to propose an automated time series forecasting technique utilizing large language models and to empirically validate its effectiveness through Seoul air quality data. Specifically, we aim to develop a methodology that automates the entire process from data preprocessing, model selection, and hyperparameter optimization to result analysis and report generation, thereby achieving high-level forecasting performance while minimizing expert intervention.
2. Theoretical Background and Prior Research
2.1 Prior Research on Air Quality Prediction
Air quality forecasting research has been conducted extensively across interdisciplinary fields including environmental science, public health, and meteorology. Research on fine particulate matter (PM2.5) has received sustained attention due to the severity of its health impacts and the complexity of prediction. PM2.5 forecasting research in Korea began in earnest in the early 2000s.
From a time series forecasting methodology perspective, Shen et al. (2020) predicted PM2.5 in Seoul using the Prophet model and achieved a mean absolute error of 12.6 μg/m³ in 2019 concentration forecasts through three years of data training. Meanwhile, the COVID-19 pandemic provided new perspectives for air quality research. Han et al. (2020) analyzed PM2.5 concentration changes during Seoul's social distancing period and confirmed a 10.4% decrease in PM2.5 concentrations in a 30-day before-and-after comparison starting February 29, 2020. Internationally, Lee et al. (2020) developed a gradient boosting machine (GBM)-based machine learning model using air quality monitoring data from Taiwan. This model demonstrated superior performance compared to traditional forecasting models, particularly improving the R² value from 0.58 to 0.71 in the Taichung region.
Kim and Jung (2022) compared deep neural network (DNN), random forest (RF), support vector regression (SVR), and long short-term memory (LSTM) models for predicting fine dust concentrations in Seoul, finding that the SVR model showed the best performance with 87.14% prediction accuracy for fine dust grade classification.
Oh and Lim (2023) analyzed the extreme value distribution of PM2.5 concentrations in Seoul by region to identify spatial variability and extreme value characteristics of air quality data. This study provides theoretical background for the high variability and existence of structural change points in the citywide average PM2.5 data addressed in this research by statistically analyzing the uneven distribution of PM2.5 concentrations and extreme value occurrence patterns by region within Seoul. Lee et al. (2023) performed forecasting by integrating fine dust concentration data from Seoul and adjacent regions using the Informer model. Additional utilization of fine dust concentration data from Chungcheongnam-do resulted in improved prediction performance with a mean absolute error (MAE) of 12.13 μg/m³.
Lee et al. (2024) analyzed the causes of high PM2.5 concentration occurrences in Seoul from 2015 to 2021 using machine learning and chemical transport models. They applied meteorological normalization techniques to remove the effects of weather conditions and quantified the impact of policy implementation on PM2.5 changes. Park et al. (2024) developed a convolutional neural network (CNN) model for predicting monthly PM2.5 concentrations in Seoul. This study utilized ERA5 reanalysis data and Korean Polar Prediction System (KPOPS) dynamic forecast data to construct the model with 2008-2018 data and achieved a mean bias error of 0.05 μg/m³, root mean square error of 2.41 μg/m³, and Pearson correlation coefficient of 0.85. Chang et al. (2024) compared various machine learning models using PM2.5 data collected in Bishkek, Kyrgyzstan, and reported that random forest regression models showed higher accuracy (90%) compared to LSTM.
2.2 Time Series Forecasting Models
Time series forecasting utilizes diverse approaches ranging from statistical models to modern machine learning-based models. After Brown (1956) first proposed simple exponential smoothing, Holt (1957) developed this further to devise double exponential smoothing that could reflect trends. Subsequently, Winters (1960) added seasonality factors to Holt's method to complete triple exponential smoothing. The Autoregressive Integrated Moving Average (ARIMA) model is a statistical time series forecasting technique systematically established in Box and Jenkins (1970) research. Box et al. (2015) presented modern interpretations of ARIMA model parameter selection and fitting processes, proposing methods for selecting appropriate models through parameter estimation using Maximum Likelihood Estimation (MLE) and model comparison processes based on AIC and BIC. Prophet is an open-source time series forecasting model developed by Taylor and Letham (2018), characterized by its design for easy use by non-experts. The Prophet model combines three elements—trend, seasonal effects, and holiday effects—in additive form to express them as functions of time, utilizing Bayesian prior distributions to prevent overfitting.
In the deep learning field, LSTM has been proven effective for time series prediction by compensating for shortcomings of existing RNNs such as the gradient vanishing problem (Hochreiter and Schmidhuber, 1997, Jung et al., 2023). Particularly, Lee et al. (2020) reported that deep learning techniques such as LSTM and RNN show superior performance compared to traditional statistical models in predictions reflecting time series characteristics using mining process data for silica concentration prediction research. Additionally, Kang et al. (2021) demonstrated that LSTM models applying Dual Attention structures can significantly improve prediction accuracy compared to existing LSTM models in crop yield prediction. Recently, the Transformer model proposed by Vaswani et al. (2017) has been utilized for time series prediction as it can directly model relationships between all-time points within sequences through self-attention mechanisms. Indeed, Kaushik et al. (2021) demonstrated that deep learning models such as LSTM, CNN, and GRU show superior prediction performance compared to traditional statistical models for non-stationary time series data with multiple structural changes and high volatility. This supports the necessity for models with built-in change point detection functions like Prophet, or deep learning-based models robust to structural changes.
2.3 Factors Affecting Forecasting Model Performance
The performance of machine learning models depends significantly not only on model architecture but also on various hyperparameter configurations. Bergstra and Bengio (2012) theoretically and empirically demonstrated that random search can achieve higher efficiency compared to grid search by exploring more diverse combinations within the same computational resources.
The evolution of hyperparameter tuning has expanded beyond individual model optimization to encompass the concept of automated machine learning (AutoML). Feurer et al. (2015) proposed the auto-sklearn system, which searches for optimal combinations among predefined learning algorithms and preprocessing techniques through Bayesian optimization, while enhancing search efficiency through meta-learning that leverages performance information from similar datasets.
2.4 Prior Research Analysis and Limitations
The convergence field of generative AI and time series analysis has been developing rapidly in recent years, yet considerable limitations still exist in practical applications. Research on the utilization of generative AI in time series analysis has proceeded in three main directions. First, as a direct time series forecasting approach, Gruver et al. (2023) proposed a zero-shot method that performs time series forecasting using pre-trained large language models without additional training. They encoded continuous time series values into character-based strings and converted them into next-token prediction problems, enabling GPT-3 and LLaMA 2 to directly generate future values of time series, achieving performance similar to existing time series-specialized models without separate prediction model training. Second, as LLM-in-the-loop methodology research, Jiang et al. (2025) demonstrated through the TimeXL framework that LLMs can simultaneously improve both performance and interpretability of time series forecasting through prediction, reflection, and improvement stages. This presented a new paradigm that goes beyond simply using LLMs as information generation tools to actively utilize LLMs as core components in model optimization processes. Third, as automated hyperparameter optimization research, Liu et al. (2024) showed through the AgentHPO system that LLMs can automatically perform hyperparameter optimization to obtain efficient and interpretable results without human expert intervention. This presented flexible optimization strategies utilizing LLM domain knowledge and reasoning capabilities, beyond traditional AutoML's fixed algorithm search approaches.
Comprehensive analysis of existing research reveals the following major limitations. First, there is limited scope of model application. While Gruver et al. (2023)'s zero-shot approach is innovative, it does not address fine-grained optimization tailored to individual model characteristics, and most research is confined to specific model groups, lacking integrated comparison and optimization encompassing both statistical-based and deep learning-based models. Second, there is an absence of systematic automation. Existing research focuses on hyperparameter optimization or prediction performance improvement, but complete automation of the entire analysis pipeline from data collection through preprocessing, model training, performance evaluation, result interpretation, to report generation remains a research gap. Third, there are limitations in empirical validation. Many studies remain at the level of conceptual proposals or experiments using limited benchmark data, lacking performance verification on complex time series data reflecting actual domain characteristics. For example, Lee et al. (2020)'s mining data research and Han and Yu (2023)'s Baltic Dry Index prediction research remain confined to domain-specific empirical analysis, having limitations in verifying general applicability to various time series problems. Particularly, as Jin et al. (2024) pointed out, general solutions capable of solving the limitations of existing time series models being confined to specific prediction tasks and heavily dependent on domain knowledge and sophisticated hyperparameter tuning remain inadequate. Fourth, there are balance issues between interpretability and practicality. There is insufficient verification of consistency between logical rationale provision for improvements suggested by LLM-based approaches and actual performance enhancement, as well as stability and reproducibility of such systems in practical environments.
This study proposes a comprehensive time series forecasting automation framework utilizing LLM-in-the-loop approaches to overcome these limitations. Specifically, it applies intelligent hyperparameter tuning using Gemma3:27B to various time series forecasting models from statistical-based models to state-of-the-art deep learning models, realizing customized optimization considering model-specific characteristics. Additionally, it constructs an end-to-end system that provides expert-level analysis results while minimizing data scientist intervention by completely automating the entire analysis process from data collection, preprocessing, model training, performance evaluation, result interpretation, to report generation. Through this approach, we aim to overcome the limitations of prior research and simultaneously achieve improved time series forecasting accuracy and complete automation of analysis processes, presenting a general-purpose time series analysis solution immediately applicable in practice.
3. Research Methods
3.1 LLM-based Hyperparameter Tuning for Time Series (LHTT)
This study proposes the LLM-based Hyperparameter Tuning for Time Series (LHTT) model for time series forecasting automation utilizing generative AI. LHTT is an integrated framework that automates the entire process from selection of various time series forecasting models through automatic tuning to performance evaluation, utilizing the Gemma3:27B large language model as a core component. LHTT is designed around four layers as shown in Figure 1: data preparation module, forecasting model learning module, LLM prompt module, and evaluation-feedback module.
3.2 Data Preparation Module
The data preparation module handles the complete workflow from external data collection to preprocessing for model-ready data preparation. For this study, fine particulate matter (PM2.5) concentration data from Seoul air quality information was selected as the analysis target. As shown in Table 1, the data collection period was approximately one month from May 1, 2025, to May 31, 2025, utilizing a total of 18,600 time series data measured hourly from 25 monitoring stations across Seoul.
The collected air quality data underwent multiple preprocessing steps to be processed into a form suitable for model training. Time information was converted to Python's datetime format and then set as an index suitable for time series analysis. Missing values were replaced using time-based interpolation methods, and outliers were detected according to statistical criteria and processed by removing or replacing them as necessary.
To improve the performance of time series forecasting models, data was split into training and test sets while maintaining temporal order, with generally 80% used for training and 20% for performance validation. Additionally, normalization was applied to adjust the scale of all features to ensure model training stability.
3.3 Model Learning Module
The forecasting model learning module performs dynamic generation and training of various time series forecasting algorithms. Through model factory functionality, models such as Exponential Smoothing, ARIMA, Prophet, LSTM, and Transformer can be dynamically generated, and each model is retrained reflecting hyperparameter settings proposed by the LLM and performs forecasting for test intervals. LHTT supports five forecasting models encompassing both statistical-based and deep learning-based models considering the characteristics of time series data (Table 2).
Exponential Smoothing was implemented as triple exponential smoothing reflecting 24-hour periodicity using the Holt-Winters implementation of the statsmodels library. The ARIMA model utilizes the auto_arima function of the pmdarima library to search for optimal orders, with ARIMA(1,0,1) structure as the default setting due to data period constraints. The Prophet model was implemented using Facebook's fbprophet library to activate only daily seasonality and set changepoint_prior_scale to 0.1 to effectively capture abrupt changes. The LSTM model was designed as a single-layer structure applying 24-hour input sequences and dropout 0.2 based on TensorFlow-Keras. The Transformer model was implemented as an encoder-centric architecture based on PyTorch, configured with 2 encoder blocks, 4-head multi-head attention, and 64-dimensional model.
3.3 LLM Prompt Module
The LLM prompt module manages interactions with the Gemma3:27B model. This module automatically generates query strings for LLM by synthesizing current model settings, performance, and results, and the generated prompts are delivered to the Gemma 3:27B model running in the local environment through Ollama API to receive model structure and hyperparameter improvement suggestions in JSON format. Automatic hyperparameter tuning, which is a core component of LHTT, is implemented as an iterative feedback loop utilizing the Gemma3:27B model. The system runs in the local environment through Ollama API and uses OpenAI-compatible client libraries to handle prompt delivery and response reception.
3.3.1 Prompt Design and Structuring
Systematic prompt engineering was performed for effective interaction with LLM. As presented in Table 3, prompts for hyperparameter tuning include current model configuration, training data characteristics, and performance evaluation results structured in JSON format.
3.3.2 Iterative Feedback Loop Implementation
The automatic tuning process of LHTT proceeds in the following sequence. First, initial hyperparameters are set based on domain expert knowledge and literature, and forecasting models are trained. Second, structured prompts are generated by combining model performance indicators and hyperparameter information, and improvement plans are queried to the LLM. Third, parameter changes proposed by the LLM are verified and applied to the model for retraining. Fourth, changes are maintained if performance improves, otherwise previous settings are restored. This process is repeated for a user-specified number of times, and all steps are automatically logged to ensure reproducibility.
3.3.3 Reproducibility and Reliability Measures
The following technical measures were applied to ensure reliability and reproducibility of the LHTT model. Random seeds were fixed at 42 in all experiments to guarantee consistency in model weight initialization and batch sampling. To reduce probabilistic variability in LLM responses, the temperature parameter was set to 0.2 and structured output in JSON format was required. Detailed logs including original prompts, LLM responses, applied hyperparameter changes, and performance metric changes were automatically recorded at each feedback loop stage. Additionally, validity verification steps for LLM suggestions were implemented to filter out unrealistic parameter values or technical errors in advance.
3.4 Evaluation and Feedback Module
Finally, the evaluation and feedback module handle model performance measurement and iterative improvement processes. In the performance evaluation stage, various metrics such as MAE, RMSE, and MAPE are calculated to quantitatively measure model forecasting performance, and through feedback loops, evaluation results are transmitted back to the LLM prompt module to induce the next tuning cycle.
4. Experimental Results and Analysis
4.1 Data Characteristics Analysis
Basic statistical analysis of the collected PM2.5 data revealed a total of 18,600 data points with a mean of 19.32 μg/m³ and standard deviation of 10.09 μg/m³. The minimum value was 1.8 μg/m³ and maximum value was 52.64 μg/m³, showing a wide range of values with a distribution slightly skewed to the right relative to the mean (Table 4). Particularly, numerous values existed that significantly exceeded WHO's 24-hour average PM2.5 recommended standard of 15 μg/m³. The high variability of PM2.5 concentrations observed in this study (standard deviation 10.09 μg/m³) and maximum value of 52.64 μg/m³ are consistent with the extreme value characteristics of Seoul PM2.5 reported by Oh and Lim (2019), reflecting the inherent complexity of Seoul air quality data.
Time series decomposition based on STL was performed for time series characteristics analysis, revealing that PM2.5 concentration time series data showed distinct 24-hour periodic seasonality. Decomposition results showed that the trend component varied between 5.98 and 41.99 with a range of 36, the seasonal component varied between -1.37 and 1.30 with a range of 2.67, and the residual component varied between -14.68 and 13.35 with a range of 28.04. Appropriate decomposition processes are important for effective analysis of time series data. Han and Yu (2023) achieved improved performance compared to existing models in Baltic Dry Index prediction by combining time series decomposition techniques with data augmentation techniques, supporting the validity of the time series decomposition analysis applied in this study.
ADF and KPSS tests were performed to verify the stationarity of the time series, with results summarized in Table 5. The ADF test results showed a test statistic of -4.0979, which is smaller than the 1% critical value of -3.4393, and a p-value of 0.0010, which is smaller than 0.05, rejecting the null hypothesis (unit root exists) and indicating that the time series satisfies stationarity. Conversely, the KPSS test results showed a test statistic of 0.6468, which is larger than the 5% critical value of 0.463, and a p-value of 0.0184, which is smaller than 0.05, rejecting the null hypothesis (time series is stationary) and indicating non-stationarity.
Such contradictory results from the two tests suggest the possibility that the time series contains a deterministic trend. This is because the ADF test examines the existence of stochastic trends, while the KPSS test examines the existence of deterministic trends. However, as shown in Figure 2, the ACF (autocorrelation function) analysis showed a pattern of slowly decreasing correlation coefficients, and stationarity was confirmed in the ADF test, so it was decided to use the original time series data directly for modeling without additional differencing transformations. This is to prevent information loss due to excessive differencing and to enable optimization tailored to individual model characteristics through LLM-based hyperparameter tuning in subsequent modeling.
For change point analysis, the L2 method was used with a minimum segment size of 30 to detect change points, resulting in the identification of 19 structural change points. The most dramatic change occurred at 2025-05-10 20:00 with a 292.9% increase (25.26 μg/m³ increase), while the largest decrease was observed at 2025-05-16 19:00 with a 69.1% decrease (18.86 μg/m³ decrease). Major change points included 109.6% increase at 2025-05-18 01:00, 102.1% increase at 2025-05-26 19:00, 102.7% increase at 2025-05-22 15:00, and 97.7% increase at 2025-05-25 13:00, with multiple points showing abrupt changes exceeding 100%. Additionally, significant change points exceeding 50% were observed, including 62.8% increase at 2025-05-06 20:00, 56.4% increase at 2025-05-08 07:00, and 63.9% decrease at 2025-05-09 13:00 (Figure 3).
These frequent and abrupt change points indicate that PM2.5 concentrations exhibit highly dynamic characteristics and may have been influenced by complex external factors such as rapid environmental condition changes, meteorological pattern variations, or fluctuations in anthropogenic emission sources. Particularly, extreme changes exceeding 200% observed at some change points may be associated with episodic pollution events or sudden meteorological condition changes. These numerous structural change points and high variability suggest that flexible models capable of dynamically learning temporal pattern changes (such as LSTM and Transformer) are more suitable than simple linear models or statistical models with static parameters. This explains why deep learning models achieved R² values of 0.75-0.79 while statistical models showed negative R² values, as the frequent regime shifts fundamentally violate the stationarity assumptions of traditional time series models, justifying the need for adaptive architectures and LLM-guided hyperparameter tuning to balance sensitivity to structural changes with prediction stability.
4.2 Baseline Model Performance Comparison
Baseline performance comparison of the five models revealed significant performance gaps between deep learning models and statistical-based models (Table 6). Exponential Smoothing showed the lowest performance across all evaluation metrics with MSE 9289.78, RMSE 96.38, MAE 79.99, R² -114.59, and MAPE 313.59%. Particularly, the extremely low R² value of -114.59 indicates that this model failed to explain any variability in the PM2.5 data.
The ARIMA model showed considerably superior performance compared to exponential smoothing with MSE 191.14, RMSE 13.83, MAE 11.03, R² -1.38, and MAPE 32.92%, yet still demonstrated significantly lower accuracy than deep learning models. The Prophet model showed even poorer performance than ARIMA with MSE 690.13, RMSE 26.27, MAE 23.77, R² -7.59, and MAPE 78.08%.
In contrast, deep learning models demonstrated superior performance. The LSTM model showed significantly better metrics than statistical-based models across all indicators with MSE 20.34, RMSE 4.51, MAE 3.53, R² 0.75, and MAPE 11.97%. Particularly, R² 0.75 indicates that the model can explain 75% of data variability. The Transformer model recorded MSE 16.65, RMSE 4.08, MAE 2.95, R² 0.79, and MAPE 9.53%, achieving the best performance among all evaluated models and demonstrating its ability to effectively capture complex patterns and long-range dependencies in PM2.5 time series data.
4.3 LLM-based Hyperparameter Tuning Results
The LLM analyzed the current performance and structure of each model to propose various hyperparameter adjustment strategies. For Exponential Smoothing, it proposed changing the existing additive trend and seasonality to a multiplicative model and activating damped trend. For the ARIMA model, it proposed increasing both AR and MA orders from 1 to 2, considering the stronger autocorrelation patterns observed.
For the Prophet model, it increased changepoint_prior_scale from 0.01 to 0.1 to enable more sensitive detection of change points in the data. For the LSTM model, it changed the LSTM unit configuration from a single layer of 32 units to a two-layer structure of [64, 32] to improve model expressiveness, and increased the input sequence length from 24 to 48 to learn longer temporal dependencies. For the Transformer model, it increased window_size from 24 to 48 to utilize a broader range of historical information, increased embedding dimensions from 32 to 64, and increased the number of attention heads from 2 to 4 to enhance model expressiveness. After retraining each model with the hyperparameters proposed by the LLM, statistical-based models showed dramatic performance improvements (Figure 4).
Exponential Smoothing showed dramatic improvement with MSE decreasing by 98.52% and RMSE decreasing by 87.85%. The Prophet model also showed significant improvements with MSE decreasing by 53.39% and RMSE decreasing by 31.72%. The LSTM model showed further enhanced performance through LLM tuning. MSE decreased by 32.54%, and RMSE decreased from 4.51 to 3.70, representing a 17.86% reduction. The R² value improved from 0.75 to 0.82, and MAPE improved from 11.97% to 9.71%, representing an 18.89% improvement (Table 7). Interestingly, the Transformer model, which showed the best performance before tuning, experienced performance degradation after the parameter changes proposed by the LLM. MSE increased by 101.17%, and RMSE increased from 4.08 to 5.79, representing a 41.83% increase. This suggests that increased model complexity may have caused overfitting in the limited dataset.
4.4 Optimal Model Selection and Performance Analysis
After LLM tuning, the LSTM model showed the best performance across all evaluation metrics and was selected as the optimal model. The LLM-tuned LSTM model achieved RMSE 3.70, R² 0.82, and MAPE 9.71%, which was superior to the Transformer model that had the best performance before tuning. As shown in Table 8, the main architecture of the LLM-tuned LSTM selected as the optimal model consisted of two LSTM layers ([64, 32] units), prediction based on past 48 hours of data, dropout rate of 0.2, batch size of 32, learning rate of 0.001, and maximum training iterations of 100. This configuration enables the capture of longer-term patterns by considering PM2.5 data's daily periodicity on a 2-day basis and can learn complex nonlinear relationships while preventing overfitting through the two-layer structure and appropriate dropout rate.
R² 0.82 indicates that the model explains 82% of the variability in PM2.5 data, which is a high level demonstrating effective capture of complex patterns and variability in time series data. MAPE 9.71% means prediction with approximately 9% error relative to actual values, which represents a considerably reliable level from a practical perspective, enabling timely public health warnings and supporting evidence-based air quality management policies such as traffic restrictions or industrial emission controls during predicted high-pollution episodes.
5. Discussion and Conclusion
The comparative analysis between baseline and LLM-tuned models reveals significant insights into the nature of time series forecasting automation. The dramatic performance improvements in statistical models—Exponential Smoothing achieving 87.85% RMSE reduction—demonstrate that traditional models often operate far below their potential due to suboptimal hyperparameter settings. This finding challenges the conventional wisdom that deep learning models inherently outperform statistical approaches, suggesting instead that the performance gap often stems from configuration complexity rather than fundamental limitations.
The unexpected performance degradation of the Transformer model after LLM tuning provides crucial insights into the limits of automated optimization. While the LLM correctly identified the model's capacity for handling longer sequences, the increased complexity led to overfitting in our limited dataset. This highlights the importance of the LLM's ability to balance model complexity with data availability—a nuance successfully captured for the LSTM model, where the two-layer architecture with 48-hour sequences achieved optimal performance.
From an environmental management perspective, the achieved MAPE of 9.71% enables practical implementation of graduated response systems. Environmental agencies can establish threshold-based alert levels: green (predicted PM2.5 < 15 μg/m³), yellow (15-35 μg/m³), orange (35-50 μg/m³), and red (> 50 μg/m³), with corresponding policy interventions. The 24-48 hour prediction window provided by our framework aligns with the operational requirements for implementing traffic restrictions, adjusting industrial operations, and issuing public health advisories. Moreover, the complete automation from data collection to report generation reduces the technical burden on environmental agencies, enabling smaller municipalities to implement sophisticated air quality management systems without extensive data science teams.
In conclusion, this study proposed LHTT, a time series forecasting automation framework utilizing generative AI, and empirically validated its effectiveness through Seoul air quality data. Applying automatic hyperparameter tuning using Gemma3:27B to five time series forecasting models including Exponential Smoothing, ARIMA, Prophet, LSTM, and Transformer resulted in the LLM-tuned LSTM model achieving the best forecasting performance with RMSE 3.70, R² 0.82, and MAPE 9.71%. This represents approximately 18-33% performance improvement compared to the baseline LSTM model, while statistical-based models also showed improvements after LLM tuning.
This study advances the LLM-in-the-loop methodology by demonstrating that performance gaps between model types often reflect configuration challenges rather than algorithmic limitations, while providing a practical framework for environmental agencies to implement AI-driven air quality management with minimal technical expertise. Unlike existing AutoML's fixed algorithm search approaches, LHTT presents flexible optimization strategies tailored to problem situations by utilizing LLM's domain knowledge and reasoning capabilities, proposing a new approach that extends the traditional hyperparameter tuning paradigm. From a practical perspective, the significance lies in implementing a fully automated analysis system that can achieve high-level forecasting performance while minimizing expert intervention. Particularly, the accuracy of MAPE 9.71% represents a level applicable to actual environmental policy formulation, and the automatic analysis report generation function using LLM ensures transparency and interpretability of the analysis process.
The limitations of this study include first constraints in long-term pattern analysis due to the short data period of one month, though the high-frequency hourly observations (18,500 data points) and captured structural variations provide sufficient validation for the LHTT framework's core capabilities, with its domain-agnostic LLM-based approach suggesting potential applicability to other time series domains pending empirical validation; second, insufficient utilization of multivariate characteristics due to univariate time series analysis; third, reproducibility issues due to probabilistic variability in LLM responses; and fourth, lack of stability verification in real-time operational environments.
Future research directions require introduction of multivariate time series modeling, application of ensemble techniques, expansion to various domains, and technical improvements for real-time system application. The LHTT framework proposed in this study specifically demonstrated the potential for generative AI to revolutionize traditional data science tasks and presents a new paradigm for utilizing LLMs as intelligent analytical partners in the time series forecasting field. It is expected to contribute to improving data scientists' work efficiency and time series forecasting accuracy through expanded application to various industrial sectors in the future.