Keywords
machine learning, synthetic data, smoking, tobacco, adolescent girl, algorithm, behavioral science
Tobacco use trends among adolescents in low- and middle-income countries, and in particular narrowing gender gaps, highlight the need for interventions to prevent and/or reduce tobacco use among adolescent girls. We evaluated a social marketing program in Ghana discouraging tobacco use among adolescent girls and additionally investigated the pathways influencing smoking behaviors to identify programmatic opportunities for impact. Leveraging the data collected through the stepped wedge cluster randomized trial and panel survey of 9000 girls aged 13–19 , we sought to apply machine learning (ML) techniques to identify the most important variables for predicting initiation of smoking.
To identify predictors of smoking initiation we sought to develop a model which could accurately differentiate smokers from non-smokers and evaluated various ML approaches for training classifier algorithms to achieve this. We selected a Synthetic Minority Over-sampling Technique (SMOTE) because it optimized the recall and precision of the model. We then utilized the technique of feature importance for greater insight into how the model arrived at its decisions and to rank the most important variables for predicting smokers. To explore different dimensions of smoking behavior, including initiation and continuation, we trained our model by using several combinations of target outcomes and input variables from the panel survey.
The resulting features of smokers highlight the importance of girls’ independence and connectivity, social environment, and peer influence on likelihood of smoking, and in particular subsequent initiation. These results were largely consistent with our formative research findings based on qualitative interviews informed by behavioral science.
This novel application of ML techniques demonstrates how data science approaches can generate new programmatic insights from rigorous evaluation data, especially when data collection is informed by behavioral theory. Such insights about the relative importance of different features can be valuable input for program planning and outreach.
machine learning, synthetic data, smoking, tobacco, adolescent girl, algorithm, behavioral science
This version is a response to viewer comments. Additional details and references have been added to the Introduction section. Additionally we have clarified the secondary analyses performed, added the notable results, and expanded on a few points in the Discussion section.
See the authors' detailed response to the review by Jennifer M. Murray
See the authors' detailed response to the review by Ruoyan Sun
Machine learning (ML) is a discipline at the intersection of data science and artificial intelligence with a focus on building algorithms to make predictions without requiring explicit programming to do so1. The approach involves building a model using sample training data. These methods are increasingly being applied to support evidence-based decision making across a wide range of fields, including global health. Here we report our experience applying ML to generate programmatic insight from data collected for a rigorous impact evaluation.
Rising tobacco use among adolescents in low and middle-income countries (LMICs)i and increasing tobacco use among girls relative to boysii highlight the need for interventions to prevent and reduce tobacco use among adolescent girls. Non-cigarette tobacco products are contributing to the narrowing gender gap,iii and surveys suggest shisha use is now more prevalent among girls than boys in Ghana, although with significant regional variation.iv As external evaluators of a social marketing program in Ghana focused on discouraging tobacco use among adolescent girls, we conducted a stepped wedge cluster randomized trial and panel survey of 9000 girls aged 13–19 in select neighborhoods of Accra and Kumasi over twelve months of 2021–2022. A secondary objective was to provide additional behavioral insights on the pathways influencing smoking behavior among teenage girls in Ghana and to identify programmatic opportunities for impact. Although a formative research phase with qualitative interviews informed by the behavioral science literature was conducted to support this objective2, we were interested to explore how recent advances in ML techniques might be applied to our panel survey data to generate additional insights related to the predictors of smoking initiation and to compare and contrast findings from these two approaches.
Although our study had a relatively large sample, smoking behavior was quite rare. Only 1.6% of girls reported having ever tried smoking (defined as cigarette or shisha use) at baseline and 0.1% having smoked in the past 30 days. We wanted to think more creatively about ways to profile potential smokers in this population beyond basic demographic slices of the sample. Our objective was to first build an algorithm that would predict which girls are more likely to be smokers and then identify which factors are most important to make those predictions. In this research note we describe how we explored several ML methods to build a classifier model and then applied a ML explainability method to profile the most important predictors of smokers using different smoking definitions, as well as different data subsamples.
Data collection for the step wedge evaluation included a survey of all 9000 girls over four rounds—Baseline (before any implementation), Midline 1 (after the first period of implementation), Midline 2 (after the second), and Endline (after all areas had been activated). Participants aged 13–19 were recruited through multi-stage sampling in which neighborhood clusters in Accra and Kumasi were first randomly selected for the study, and then households within each cluster were systematically approached through a community mapping process until the target number of adolescent girls willing to participate in the study were identified. To be considered eligible for the study, girls had to have access to a phone and had to intend to remain at their residence over the subsequent year, to enable enumerators to reach them at future rounds of the panel survey. The questionnaire was informed by the formative research and a smoking-oriented theory of change and comprised about 100 questions covering: background and demographic information; social context, confidence, and self-efficacy; sources of influence on girls’ decisions and actions; tobacco perceptions and norms; tobacco use, opportunity, and refusal; and program exposure and perceptions. Tobacco use questions were adapted from the Global Youth Tobacco Survey.v The questionnaire was reviewed for face validity with the program team and portions were piloted with 396 adolescent girls during the formative research phase2. Most responses were binary or Likert scales of 1 to 4 that were recoded to binary (agree vs. disagree, likely vs. not likely) for this exploratory exercise. Several indices of multiple survey items that were confirmed to have high internal consistency served as indicators of intermediate outcomes in the impact evaluation; however, for this application of ML techniques survey items were maintained as individual variables. Ethical review and approval for the evaluation study were provided by Innovations for Poverty Action IRB (protocol #15798) and the Ghana Health Service Ethics Review Committee (GHSERC: 003/11/20). Written informed consent to participate in the study was obtained from participants aged 18 years or older. For younger participants, written parental consent was first obtained followed by written assent of the adolescent.
Initially, we trained a basic Random Forest classifier using the baseline data to identify girls who have ever smoked, which performed well in terms of accuracy (Table 1). Accuracy is defined as the proportion of all girls who are correctly classified as either smokers or non-smokers (Figure 1).
To optimize our baseline model, we chose to focus on increasing the recall metric, rather than just accuracy. Recall is a measure of the model's ability to correctly identify actual smokers; a high recall value indicates that the model is able to identify most girls who smoke, while a low recall value suggests that the model is missing a significant number of smokers. Accuracy, which is affected by the low prevalence of smoking in our data, can bias the model towards classifying most individuals as non-smokers. Therefore, we were more concerned with correctly identifying those who did smoke to ensure effective programmatic targeting towards likely smokers, even if it meant that some non-smokers were mistakenly identified as smokers.
Additionally, we focused on improving the model’s precision. If the precision of the model is low, it means that the model is classifying many individuals as smokers even though they are not, which could result in potentially unnecessary or less well-targeted interventions. Therefore, improving precision helps to reduce targeting of non-smokers and ensures that the resources allocated to smoking interventions are still used effectively and efficiently.
With the objective of optimizing the recall and precision, we evaluated the following ML approaches using the “imbalance-library” Python package3 (which is built upon the “scikit-learn” ML library) to determine which method produced the optimal outcome:
RandomOverSampler: This technique works by randomly oversampling the minority class (smokers) in the dataset, i.e., it generates additional random samples of the minority class to balance the class distribution.
ADASYN (Adaptive Synthetic Sampling)4: ADASYN works by generating synthetic samples of the minority class based on the density of the minority class in the feature space and the distance between minority samples. The goal is to increase the number of samples of the minority class in the training data while maintaining a balanced representation of the classes in the feature space.
SMOTE (Synthetic Minority Over-sampling Technique)5: Like ADASYN, SMOTE works by generating synthetic samples of the minority class, but instead of generating samples based on the density and distance of the minority class, it generates samples based on the k-nearest neighbors of the minority samples. In this study 5 k-neighbors were used.
Class weighting: This method assigns higher weights to the minority class, encouraging the algorithm to pay more attention to the minority class and make better predictions.
Hellinger Distance as a Tree Split Criterion6: Hellinger distance is a measure of the difference between two probability distributions. In decision tree algorithms, Hellinger distance can be used as a criterion for splitting nodes in the tree, to determine the most informative split that maximizes the difference between the classes in the feature space. Using Hellinger distance as a split criterion can improve the accuracy of decision tree algorithms in imbalanced classification tasks by giving more attention to the minority class.
We ultimately selected the SMOTE technique of generating synthetic data for training the classification algorithm because it optimized its performance in terms of recall and precision (Table 1).
After successfully building a high-performing classification model, our goal was to build a profile of girls who smoke. However, ML models are often referred to as "black boxes" because they can be difficult to interpret and comprehend. This means that the model might make accurate predictions, but it's not clear how it arrived at those predictions or why a particular prediction was made.
To gain more transparency, we utilized the technique of feature importance. This method explains predictions made by the model by identifying which features or input data had the most significant impact on those predictions. Essentially, it helps us understand which characteristics have the most influence in predicting a girl to become a smoker. By calculating feature importance we aimed to generate greater insight into how the model arrived at its decisions and to rank the most important variables. We determined whether an important feature is a risk for or protective against smoking based on the direction of that variable’s association with the smoking outcome in our data.
To determine the profiles of girls most likely to smoke, we trained our model by using the following combinations of target outcomes and input variables from our survey data:
Model #1: Predicted outcome: reported ever tried smoking at baseline.
Model #2: Predicted outcome: reported having smoked during the previous month at endline.
Model #3: Predicted outcome: among non-smokers at baseline, reported ever tried smoking at endline.
As a secondary analysis we looked at these same combinations within sub-groups of respondents, including younger (13–15) vs. older girls (16–19) and Accra vs. Kumasi, to support interpretation of findings and identify heterogeneity in predictors. We also analyzed additional combinations of data inputs on these predicted outcomes at different time points for robustness.
Model #1 (Table 2) differentiates between girls based on smoking history prior to the study. The results highlight that girls who had ever tried smoking report different social settings and experiences at present than non-smokers. They are more likely to have regular phone access, recently attended a party, and/or been recently offered alcohol, which suggest more independence and social activity. They are also more likely to think their close friends have tried shisha, that their friends would approve of them smoking shisha, and to say that they would not likely refuse an offer of shisha from friends. Several indicators included in the evaluated program’s theory of change, particularly those related to girls’ confidence in expressing preferences to friends, were not important to classifying likely smokers.
Given that smoking in this population is rare and infrequent, Model #2 (Table 3) identifies which factors are most important to differentiate recent smokers from recent non-smokers. Endline responses were used given very few recent smokers at baseline. Recent smokers are more likely to be older and live in the capital Accra and report lower parental influence than non-smokers, another sign of greater independence. They are more likely to report that smoking is common among their peers and that they would accept an offer to smoke from friends, and less likely to perceive shisha as harmful. The belief that people who say no are admired by others is one of the top factors predictive of being a recent smoker; this may suggest positive views of smoking as demonstrating independence or rebelliousness against norms.
From Model #3 (Table 4) we profiled the small number of girls who were non-smokers at baseline but reported they had tried smoking by endline, in order to identify which factors are most important to set them apart from girls who remained non-smokers during the study period. We see that age and phone access were most important to subsequent smoking initiation, which suggests that these girls may already have had more independence and connectedness than other non-smoking girls. The girls that started smoking within the study period were also more likely to perceive that most social activities for girls their age involve shisha and that most girls their age have tried shisha, which suggests that they are also experiencing or perceiving a different social environment at baseline than other non-smokers. And these girls are more likely to report their friends have influence on their decisions, and less likely to report parent influence than other non-smokers, which suggests they might be more susceptible to social opportunity and persuasion.
In secondary analyses of Model 3, living in Accra and thinking other girls smoke shisha were the most important factors for smoking initiation among younger girls (13–15) who were non-smokers at baseline. Having friends that influence decisions was less important, and school status was one of the least important, likely because very few younger girls are not in school. Among older girls, having friends or parents who influence decisions were the most important factors. This suggests that the spaces and behaviors girls are exposed to may be more relevant for smoking initiation among younger girls, whereas interpersonal influence matters more among older girls.
We report here a novel application of ML techniques to generate insights around a behavior that is still relatively rare among urban teen girls in Ghana. Creating synthetic data through the SMOTE technique created a more balanced training data set for the classifier, and by opening up the black box of the algorithm we learned which features are most important towards predicting girls likely to be smokers.
While very few girls in Ghana have tried smoking, even fewer report having smoked in the previous 30 days. Models 1 and 2 suggest both groups are more likely to be older and have the freedom to attend social events where they may be exposed to smoking opportunities or behaviors that shape their perceptions of norms among peers. Living in Accra, the more cosmopolitan and diverse of the two study cities, was also common among both groups. However, a perceived higher prevalence of cigarette smoking and willingness to accept cigarettes from friends are distinctively strong predictors of recent smokers. This may reflect that shisha is more commonly experimented with among girls this age and is more likely to be smoked on an infrequent basis (less than monthly). For that reason, most strong predictors of ever having smoked reflect beliefs about shisha, whereas recent smokers who smoke cigarettes, while still the minority, may have particularly strong and distinctive views on cigarette smoking. Other distinctively strong predictors of recent smoking include reporting low influence of parents and expressing admiration for people who are willing to go against trends. Parents in Ghana are generally highly protective of their teenage girls and there is a strong social norm against smoking2; these findings may reflect the absence of such parental figures in smokers’ lives, or their willful rebellion against them.
Model 3 supports that changes in independence and the social environments shaping girls’ opportunities to smoke do in fact precede initiation and experimentation. Cross-sectional surveys can support associations between smoking behavior and social environment; the Model 3 classifier leverages survey data from two timepoints to highlight the pre-existing differences among non-smoking girls that are most predictive of subsequent smoking initiation. Among younger girls, this may reflect different home environments, where girls have more mobility and less parental oversight, putting some girls at more risk of smoking opportunities. These differences may also explain why older age is such a strong predictor of smoking, as adolescent girls are often given more freedom to attend social events and activities as they age. Furthermore, many girls in urban Ghana enroll in a boarding school for their Senior High School (SHS) education where they may gain exposure to girls from less sheltered backgrounds who may influence subsequent opportunities for smoking2.
Such insights about the relative importance of different features to a target behavior can be valuable input for program planning and outreach that is responsive to a specific population and context. This is especially true in the case of programs aimed at smoking prevention, given that smoking in teens is addictive and the literature suggests that people who avoid smoking in adolescence are unlikely to ever startvi. Better understanding of the risk factors for recent and future smoking behavior, and in particular the role of the social environment, can suggest programmatic opportunities for targeting adolescents most at risk or exploring more promising programmatic directions, such as reshaping the environments they are exposed to.
Although the smoker profiles generated through this approach are limited by the inputs we provide the machine, i.e., our survey data, they were largely consistent with our formative qualitative research findings around influences on girls’ behavior in urban Ghana2. The large sample size of this study obtained through cluster randomized sampling is one of its strengths, although there are potential limitations related to collecting data on sensitive topics from adolescent girls, as discussed in that study2. Data collection tools in both cases were informed by insights from the tobacco literature and behavioral science; we believe grounding survey data collection in strong behavioral science theory strengthened the utility of the ML outputs and resulted in better alignment of findings with more in-depth qualitative research techniques. This novel application of ML techniques demonstrates the potential synergism between data science and behavioral science to generate insights about predictors of behavior and highlights the importance of basing quantitative data collection in behavioral theory, especially if opportunities for rich qualitative investigation are limited in other settings.
The data that support the findings of this study are not publicly available due to privacy and ethical restrictions but are available from the corresponding author on reasonable request. A reasonable request would be from a legitimate party with a specific research objective, would not violate the privacy or ethical protections of participants, and would not require additional reformatting or repackaging of the data by the authors.
The baseline survey questions are available here: https://doi.org/10.6084/m9.figshare.245816167
This project contains the following extended data:
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
The RandomForestClassifier that was used for classification tasks belongs to the open-source Python scikit-learn library8 available at https://github.com/scikit-learn/scikit-learn, with documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. The feature importances are provided by the "feature_importances" attribute of the scikit-learn's RandomForestClassifier, which offers a way to assess the importance of each feature in making accurate predictions with the random forest model.
All the oversampling methods were performed using open-source Python imbalanced-learn library3 hosted at https://github.com/scikit-learn-contrib/imbalanced-learn with documentation here: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html.
We wish to thank the survey participants and JMK Consulting Limited for assisting with data collection. We are also grateful to Good Business and Now Available Africa for their partnership and input into the survey instruments. We also thank Lois Aryee and Jeremy Barofsky of ideas42 for their contributions to the research activities of this project, and Jean Paullin of the Bill & Melinda Gates Foundation for her continued support and guidance on these efforts.
i Stone E, Peters M. Young low and middle-income country (LMIC) smokers—implications for global tobacco control. Transl Lung Cancer Res. 2017 Dec;6(Suppl 1):S44–6.
ii WHO | Regional Office for Africa [Internet]. [cited 2022 Oct 11]. Tobacco Control. Available from: https://www.afro.who.int/health-topics/tobacco-control
iii Agaku IT, Sulentic R, Dragicevic A, Njie G, Jones CK, Odani S, et al. Gender differences in use of cigarette and non-cigarette tobacco products among adolescents aged 13–15 years in 20 African countries. Tob Induc Dis. 2024 Jan 22;22:10.18332/tid/169753.
iv Logo DD, Kyei-Faried S, Oppong FB, Ae-Ngibise KA, Ansong J, Amenyaglo S, et al. Waterpipe use among the youth in Ghana: Lessons from the Global Youth Tobacco Survey (GYTS) 2017. Tob Induc Dis. 2020 May 29;18:47.
v Global Youth Tobacco Survey Collaborative Group. Global Youth Tobacco Survey (GYTS): Core Questionnaire with Optional Questions, Version 1.2. Atlanta, GA: Centers for Disease Control and Prevention, 2014.
vi Tyas SL, Pederson LL. Psychosocial factors related to adolescent smoking: a critical review of the literature. Tob Control. 1998 Dec 1;7(4):409–20.
Views | Downloads | |
---|---|---|
Gates Open Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Aryee L, Flanagan S, Trupe L, Yucel M, et al.: Social norms and social opportunities: a qualitative study of influences on tobacco use among urban adolescent girls in Ghana. BMC Public Health. 2024; 24 (1). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Biostatistics, Machine Learning and Deep Learning, multiple testing in high-dimensional data, infectious disease
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Tobacco control; policy evaluation; health economics; modeling and simulation.
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Public health, health behavior change, physical activity, adolescent smoking, peer influence, mediation analyses, SIENA modelling
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Birdsey J, Cornelius M, Jamal A, Park-Lee E, et al.: Tobacco Product Use Among U.S. Middle and High School Students - National Youth Tobacco Survey, 2023.MMWR Morb Mortal Wkly Rep. 2023; 72 (44): 1173-1182 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Tobacco control; policy evaluation; health economics; modeling and simulation.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 20 Nov 24 |
read | read | |
Version 1 03 Jan 24 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Register with Gates Open Research
Already registered? Sign in
If you are a previous or current Gates grant holder, sign up for information about developments, publishing and publications from Gates Open Research.
We'll keep you updated on any major new updates to Gates Open Research
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)