Unraveling the Trajectory of Data Science Salaries in the United States: A Comprehensive Analysis from 2020 to 2023 with Future Salary Projections

: The purpose of this study is to analyze the impact of different positions, levels of expertise and company size on salary levels in the field of data science and to make salary projections for data science professionals in 2024. This research project can help data science professionals to understand what are the important factors that affect salary levels and to understand the salary environment and trends in the data science field in 2024. With the explosive growth of big data, the oversupply of data science jobs and the precise hiring needs have led to significant changes in the salaries of data science related careers from year to year. Data is thoroughly cleaned and preprocess sed to maintain data quality and consistency, including handling missing values and removing outliers. Descriptive analysis techniques were then used to understand the current state of data science salaries, calculating data such as mean, median and standard deviation. Time series modeling was used to determine how key factors affect pay levels over time. To further investigate salary trends, ARIMA was applied to visualize the evolution of data science salaries from 2020 to 2023, and then to forecast average salary levels for different positions in the data science field in 2024. In summary, the important factors affecting data science salaries and the trend of salaries for different careers in data science in 2024 are analyzed, and a detailed analysis is provided with salary as a key factor to provide valuable recommendations for data science stakeholders.


Introduction
Data science has become a transformative field that harnesses the power of data to discover valuable insights and results, make informed decisions, and drive innovation across industries.As companies become increasingly aware of the importance of data science, the demand for data science professionals has skyrocketed.At the same time, data science employment opportunities are growing, and data science talent is clustering in the U.S. as a global centre for technology and innovation.
Apply big data principles and methods into this field to improve business efficiency and make wiser decisions, which has become a frequent topic of discussion among scholars [1].
This study focuses on the evolution of Data Science salaries in the US from 2020 to 2023 and reveals the trends and patterns that emerge during this period.Understanding the evolution of salaries provides insights into the dynamics of the job market, the state of the economy, and the factors that influence the salary levels of data science professionals.The job market for data science professionals is dynamic and diverse, and we have categorized these different careers into four distinct areas, including engineers, analysts, scientists, and architects.These professionals are sought after by major companies for their expertise in working with big data, implementing advanced analytics, and building machine learning models to drive decision making.
Understanding salary trends is not only critical for job seekers and employees looking for fair compensation packages, but also for major organizations and businesses that need this data to give talent in the data science field a fair match that can be used to retain such talent.Analyzing salary trends can provide insights into the growth and maturity of the field, allow those working in data science to make better choices about the careers they want to pursue, and give others an understanding of the current state and future of data science as a discipline, as well as organizations and companies with a need for data science talent an understanding of salary trends.
In this study, we collected data from job postings and salary reports for different data science positions in the United States from 2020 to 2023.We first analyze the experience level (e.g., entry level, mid-level, senior level), and company size (small to large), as well as the impact of different Job Title on salary.Data analysts, data engineers, machine learning engineers, and data scientists are all job titles related to data science [2].Linear regression was used to conduct exploratory analyses of the data to test the fit between different variables and salary, and control variables were used to test the fit between their variables and salary under different combinations of variables.We use these analyses to spread out the interactions between the various data variables, to distinguish between levels of experience, company size, and the relationship between different occupations and salary trends.In addition, to forecast future salary trends, we used time series modeling and historical salary data to produce a forecasting model that explains the dynamic nature of salaries over time.In this way, we were able to make a basic forecast of data science salaries in the United States in 2024.Through multiple linear regression and time series modeling, we hope to provide comprehensive insights that will lead to meaningful analytics for job seekers, employees, and employers in the data science field.

Factors Affecting Data Science Salaries
One of the factors why Data Scientists are highly demanded lies in its role, in which its broad and general, while being essential in each operational process within a business [3].Tee Zhen Quan and Mafas Raheem's journal on Data Science Salary Prediction conducted a study regarding the Top 10 Most Current Demanded Data Science Jobs and discovered a pattern that broader job tasks and a less job specific title result in higher demand, which in turn also shows that the number or range of tasks play a role in determining the salaries of data scientists.This is also normal since in general, higher job status result in higher salary and demand.A range of skills acquired by job titles and their relationship to salary amount were also conducted.The total demand in 2021 were documented, and figure 1 shows that Python, SQL, and R were the hottest skillsets demanded, while other programming languages share a similar low-demand trend.Figure 2 shows that an impressive detail was that the rate of demand growth for each skill set was also considered, which could play an important role in influencing the different titles within Data Science and the Data Science Industry as a whole.

Models to Predict Data Science Salary
Unfortunately, the Salary Prediction Model by [4] Tee Zhen Quan and Mafas Raheem was developed through the SAS Enterprise Miner platform, lacking both transparency and could generate biased results.Moreover, not many details were stated for the data preparation except mentioning a few nodes containing "high-performance models and algorithms".For the results, salary increases across the "job hierarchy ladder", from the lowest junior level to the highest principal level, reaching expected results.Similarly, in our research, we also analyzed the level of "expertise level" on salary.
Most of the research paper conducted on Data Scientists' salary focus more on the skill aspects but didn't consider one of the key determinators of salary -the company.In order to work on this, we decided to include "company size", which could indirectly show the power or status of the company.With this information, future studies regarding data scientists' influence on company status or progression within its industry could be conducted after a certain amount of time, with sufficient develop for the data science field and given a noticeable influence by them.

Methods for Time-series Prediction
In time series analysis, ARIMA models are flexible and popular since it offers accurate short-run predictions for quantity-supported datasets [5].Mohamed Reda Abonazel and Ahmed Ibrahim Abd-Elftah used ARIMA model in-depth to forecast Egyptian GDP and appreciated its "long-term memory" on using previous data of the year before.In our research about data scientists' salary and job titles within forecasts, we also noticed how the "long-term memory" benefit us to find complicated details which converged into a long-term trend, resulting in an essential key towards accurate predictions.
Computer Science is a well-developed field, thus various algorithms about salary prediction with different goals are already a popular subject.Unlike our goal of predicting the general trend of salary for Data Science jobs [6], this research paper aims to predict a concise salary for the specific CS position, due to its competitive nature and clear job performances.Three algorithms, namely Naive Bayes, Support Vector Machine (SVM), and Random Forest, are employed for salary prediction.These algorithms are good at real-time prediction, which satisfies their environment, while our model aims to give a general but accurate idea of the future development of Data Science.

Methodology
This dataset was chosen for this study because it comes from a reliable data source and spans a long enough period of time to cover the salaries of data science professionals from 2020 to 2023.It also includes different job titles, levels of expertise, firm size, and experience levels, as well as firm location and employee nationality.Firstly, the data is cleaned and preprocessed.Since all the currencies in the data were standardized and one of the columns "Salary in USD" and USD as the salary currency, we deleted the columns "Salary" and "Salary Currency", and then deleted and processed the data for missing values and outliers.The data was then subjected to exploratory analyses to understand the distribution and relationship between the different variables.From this step, important factors were identified to understand their impact on the salary income of data science professionals.Linear Regression is an algorithm of machine learning based on supervised learning scheme.Linear regression carries out a task that may predict the value of a dependent variable (y) on basis of an independent variable (x) that is given.Therefore, this kind of regression technique looks for a linear type of relationship between input x and output y [7].Linear regression models were then used to establish the relationship between experience level, job title, company size and salary, and these factors were used to build models to determine what factors influence salary levels in 2020-2023 and salary trends during this period, and diagnostic models were used to diagnose the feasibility of the regression models.We then used time series analysis to apply time series models such as ARIMA (Auto Regressive Aggregate Moving Average) to predict the salary trends for Data Science jobs in 2024 and used historical data on salaries to see where salaries were going for different occupations between 2020 and 2023.

Correlation Analysis
The data were analyzed using a linear regression model based on three key factors: "Experience Level" and "Company Size" and "Job Title", and the coefficients and statistical significance of the predictor variables were assessed.The results show a significant association between data science salaries and the predictor variables.Figure 3 shows that positive salary trends were observed in terms of "level of experience," which tends to increase as professionals lengthen and progress in their careers.And, over the study period, Figure 3 also shows study found that data science professionals hired by midsize companies experienced higher salary growth compared to their counterparts at smaller companies.This result is consistent with the conventional wisdom that experience brings additional expertise, making experienced data scientists more sought after by employers.
Interestingly, we found that Figure 3 is shown some occupations have a strong correlation with salary levels.For example, Business Intelligence Developer, Data Analyst, Data Manager, Data Science Consultant, Data Strategist and so on.When job seekers choose a career in the field of data science, they can consider the salary levels of different occupations to choose.These findings provide data science professionals, as well as employers, with insights into the factors that influence salaries in the U.S. data science field and help organizations make informed decisions about compensation strategy and planning.
Overall, this analysis helps to enrich the existing knowledge of data science salaries and provides a basis for further investigation of salary trends and trends in the data science job market.As the field of data science continues to evolve, understanding the significant factors that affect salaries is important for organizations to retain talent and make competitive compensation decisions.Among them, occupation selection, company size and experience level can be an important factor for relevant personnel to look for jobs in the job market and can also become an important factor for the organization to make relevant decisions.

Model Diagnosis
However, in the diagnosis of the multiple linear regression model, we found that the model is not suitable for this data set, for the following four reasons: Firstly, when analyzing the linearity of relationships, significance of variables and multicollinearity, we found the data of model include β coefficients, the Rsq, and the F-Stats, which show that the experience level and the expertise level in the data set have a completely linear relationship as shown in Figure 4, so these two dependent variables cannot be correlated with the salary at the same time.And in order to quantify the performance of a linear model, we consider the coefficient of determination (R2), which provides an assessment of the variability of the output any linear model is able to capture and depends on the variance of the data and the sum of the squared errors of the model [8].Although our selected influencing factors are significant such as experience level, company size and occupation, the Adjusted R-square is 0.3061 obtained by the model which does not mean that it's an excellent fitting regression model.

Figure 4: Summary of Linear Regression Results
Secondly, when we analyzed whether the variances of errors are constant for predicted values which also referred to as homoscedasticity, we conducted the quick test to explore the correlation between the absolute value of the residuals and the dependent variable (in this case, the independent variable).The results indicated that the absolute value of the residuals is strongly correlated with the independent variable, because the p-value obtained by the model is less than 0.05, so we can reject the null hypothesis.In other words, the residuals will change with the dependent variable, directly indicating that the residuals will vary with the independent variable X, but the null hypothesis is that the residuals will not vary with the independent variable X.The result violates the model rules, which shows the model is not a fitting linear regression model.Thirdly, when checking for the normality of residues, we use the following methods: See histogram as shown in Figure 6 and summary stats as shown in Figure 7 to visually check for skew, check normal quantile plot of the residuals and use shapiro.test(residuals(regressedModel))for normality.The results indicate the data have many outliers as shown in Figure 5, and for normality, the plot should lie close to the normal line.However, regSalary is not normal because of the longtails.Though the p-value obtained by the Shapiro-Wilk test, we can reject the null hypothesis which means the residuals do not conform to the normal distribution.Therefore, the model is not suitable for analysis of this data set.

Predictions
Results obtained revealed that the ARIMA model has a strong potential for short-term prediction [9].Based on the analysis of time series models and Predictions (ARIMA), several very key results were obtained.From 2020 to 2021, the average salary of "Scientist" related professions experienced a decline, but from 2021 onwards, it rebounded and has been rising beyond the highest level of the average salary in 2020, showing a very good upward trend.Despite this, the average salary directly related to "Scientist" as shown in Figure 9 is expected to decline from its peak in 2023 to $162,415.60 in the 2024 forecast.Data science occupations related to "Architect" began to decline in 2021, with the average salary reaching its lowest point in 2022.However, from the lowest point in 2022, it rose to 2023, increasing by a third of the total decline level.For 2024, the average salary for "Architect" related data science careers will increase slightly to $179,248.70 in figure 10.
Salaries for "Engineer" and "Analyst" related occupations show a continuous upward trend from 2020 to 2023.The average of Engineers' salaries will see a small decrease in 2024 to $173,616.20 in figure 11, while the average of analysts' salaries will see a significant decrease to $106,192.80 in figure 12.These results highlight the dynamic nature of wage trends across occupations.Although scientists' average salaries are expected to face a temporary setback in 2024, they remain relatively high."Architect" related occupations will emerge from the recession.Although the average salary of data science occupations related to "Engineer" has decreased slightly, it is still the most stable occupation, which also helps this profession to maintain its advantage.However, the decline was most pronounced in data science occupations related to "Analyst," which also had a lower average, indicating potential challenges or changes in this job market and may cause those intending to choose this field to reconsider.
In summary, the salary trends forecast and analysis for "Scientist," "Architect," "Engineer," and "Analyst" provide analytical insights into the dynamics of the job market in these fields.Predicting trends indicates both opportunities and challenges, urging professionals and organizations to constantly adapt to change, and can likewise use the results of analysis to make their own changes and decisions.

Conclusion
Predicting salary trend in Data Science is a very sophisticated task considering that some decisive factors are not available for access and some latent yet crucial variables are difficult to be captured.Over-all, our model approves the significance of experience level, company size and different types of job title on the salary trend and possesses a relatively decent predictive ability.However, by being trained and tested based on data gathered from various types of experience level, company size and job title, our model is proven not to be an excellent fitting regression model.The main problem is that a few job titles in data science are not well-fitted, in other words, some job titles do not play a crucial role in influencing salary in data science.Therefore, our subsequent research will focus on a more appropriate regression model to figure out the relationship between the factors and the salary trends on data science.Besides the multiple linear regression model, we can explore other nonlinear models or machine learning models and see if a higher accuracy could be achieved.In addition, the research can add more objective measures and indexes to better value salary trends.For example, the working locations, the tools used, education and academic background, as well as specializations and skills.

Figure 1 :
Figure 1: Top 10 Data Science Skills in 2021

Figure 2 :
Figure 2: Top 10 Growing Data Science Skills between 2019 and 2021

Figure 3 :
Figure 3: Results of a linear regression analysis of dollar salary level versus experience level, company size, and job title.

Figure 7 :
Figure 7: Linear Regression Model Histogram and summary stats Last but not least, when we check the independence of observations which means the residuals are not autocorrelated, we obtained the p-value and D-W value from the Durbin-Watson test as shown in Figure 8 which indicated residuals are positively autocorrelated.That is to say, the model is not suitable for the data set.