<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">Gates Open Res</journal-id>
            <journal-title-group>
                <journal-title>Gates Open Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2572-4754</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/gatesopenres.14991.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Note</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Application of machine learning techniques to profile smoking behavior of adolescent girls in Ghana</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 1 approved, 2 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Flanagan</surname>
                        <given-names>Sara V.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-1707-608X</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Vargas</surname>
                        <given-names>Ariadna</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Smith</surname>
                        <given-names>Jana</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6400-8050</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>ideas42, New York, New York, 10004, USA</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:sara@ideas42.org">sara@ideas42.org</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>20</day>
                <month>11</month>
                <year>2024</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2024</year>
            </pub-date>
            <volume>8</volume>
            <elocation-id>2</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>19</day>
                    <month>11</month>
                    <year>2024</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Flanagan SV et al.</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://gatesopenresearch.org/articles/8-2/pdf"/>
            <abstract>
                <sec>
                    <title>Background</title>
                    <p>Tobacco use trends among adolescents in low- and middle-income countries, and in particular narrowing gender gaps, highlight the need for interventions to prevent and/or reduce tobacco use among adolescent girls. We evaluated a social marketing program in Ghana discouraging tobacco use among adolescent girls and additionally investigated the pathways influencing smoking behaviors to identify programmatic opportunities for impact. Leveraging the data collected through the stepped wedge cluster randomized trial and panel survey of 9000 girls aged 13&#x2013;19 , we sought to apply machine learning (ML) techniques to identify the most important variables for predicting initiation of smoking. </p>
                </sec>
                <sec>
                    <title>Methods</title>
                    <p>To identify predictors of smoking initiation we sought to develop a model which could accurately differentiate smokers from non-smokers and evaluated various ML approaches for training classifier algorithms to achieve this. We selected a Synthetic Minority Over-sampling Technique (SMOTE) because it optimized the recall and precision of the model. We then utilized the technique of feature importance for greater insight into how the model arrived at its decisions and to rank the most important variables for predicting smokers. To explore different dimensions of smoking behavior, including initiation and continuation, we trained our model by using several combinations of target outcomes and input variables from the panel survey.</p>
                </sec>
                <sec>
                    <title>Results</title>
                    <p>The resulting features of smokers highlight the importance of girls&#x2019; independence and connectivity, social environment, and peer influence on likelihood of smoking, and in particular subsequent initiation. These results were largely consistent with our formative research findings based on qualitative interviews informed by behavioral science. </p>
                </sec>
                <sec>
                    <title>Conclusions</title>
                    <p>This novel application of ML techniques demonstrates how data science approaches can generate new programmatic insights from rigorous evaluation data, especially when data collection is informed by behavioral theory. Such insights about the relative importance of different features can be valuable input for program planning and outreach.</p>
                </sec>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>machine learning</kwd>
                <kwd>synthetic data</kwd>
                <kwd>smoking</kwd>
                <kwd>tobacco</kwd>
                <kwd>adolescent girl</kwd>
                <kwd>algorithm</kwd>
                <kwd>behavioral science</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/100000865">
                    <funding-source>Gates Foundation</funding-source>
                    <award-id>INV-005809</award-id>
                </award-group>
                <funding-statement>This work was supported by the Gates Foundation [INV005809].</funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>This version is a response to viewer comments. Additional details and references have been added to the Introduction section. Additionally we have clarified the secondary analyses performed, added the notable results, and expanded on a few points in the Discussion section.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Machine learning (ML) is a discipline at the intersection of data science and artificial intelligence with a focus on building algorithms to make predictions without requiring explicit programming to do so
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup>. The approach involves building a model using sample training data. These methods are increasingly being applied to support evidence-based decision making across a wide range of fields, including global health. Here we report our experience applying ML to generate programmatic insight from data collected for a rigorous impact evaluation.</p>
            <p>Rising tobacco use among adolescents in low and middle-income countries (LMICs)
                <sup>
                    <xref ref-type="other" rid="FNi">i</xref>
                </sup> and increasing tobacco use among girls relative to boys
                <sup>
                    <xref ref-type="other" rid="FNii">ii</xref>
                </sup> highlight the need for interventions to prevent and reduce tobacco use among adolescent girls. Non-cigarette tobacco products are contributing to the narrowing gender gap,
                <sup>
                    <xref ref-type="other" rid="FNiii">iii</xref>
                </sup> and surveys suggest shisha use is now more prevalent among girls than boys in Ghana, although with significant regional variation.
                <sup>
                    <xref ref-type="other" rid="FNiv">iv</xref>
                </sup> As external evaluators of a social marketing program in Ghana focused on discouraging tobacco use among adolescent girls, we conducted a stepped wedge cluster randomized trial and panel survey of 9000 girls aged 13&#x2013;19 in select neighborhoods of Accra and Kumasi over twelve months of 2021&#x2013;2022. A secondary objective was to provide additional behavioral insights on the pathways influencing smoking behavior among teenage girls in Ghana and to identify programmatic opportunities for impact. Although a formative research phase with qualitative interviews informed by the behavioral science literature was conducted to support this objective
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>, we were interested to explore how recent advances in ML techniques might be applied to our panel survey data to generate additional insights related to the predictors of smoking initiation and to compare and contrast findings from these two approaches.</p>
            <p>Although our study had a relatively large sample, smoking behavior was quite rare. Only 1.6% of girls reported having ever tried smoking (defined as cigarette or shisha use) at baseline and 0.1% having smoked in the past 30 days. We wanted to think more creatively about ways to profile potential smokers in this population beyond basic demographic slices of the sample. Our objective was to first build an algorithm that would predict which girls are more likely to be smokers and then identify which factors are most important to make those predictions. In this research note we describe how we explored several ML methods to build a classifier model and then applied a ML explainability method to profile the most important predictors of smokers using different smoking definitions, as well as different data subsamples.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Data source</title>
                <p>Data collection for the step wedge evaluation included a survey of all 9000 girls over four rounds&#x2014;Baseline (before any implementation), Midline 1 (after the first period of implementation), Midline 2 (after the second), and Endline (after all areas had been activated). Participants aged 13&#x2013;19 were recruited through multi-stage sampling in which neighborhood clusters in Accra and Kumasi were first randomly selected for the study, and then households within each cluster were systematically approached through a community mapping process until the target number of adolescent girls willing to participate in the study were identified. To be considered eligible for the study, girls had to have access to a phone and had to intend to remain at their residence over the subsequent year, to enable enumerators to reach them at future rounds of the panel survey. The questionnaire was informed by the formative research and a smoking-oriented theory of change and comprised about 100 questions covering: background and demographic information; social context, confidence, and self-efficacy; sources of influence on girls&#x2019; decisions and actions; tobacco perceptions and norms; tobacco use, opportunity, and refusal; and program exposure and perceptions. Tobacco use questions were adapted from the Global Youth Tobacco Survey.
                    <sup>
                        <xref ref-type="other" rid="FNv">v</xref>
                    </sup> The questionnaire was reviewed for face validity with the program team and portions were piloted with 396 adolescent girls during the formative research phase
                    <sup>
                        <xref ref-type="bibr" rid="ref-2">2</xref>
                    </sup>. Most responses were binary or Likert scales of 1 to 4 that were recoded to binary (agree vs. disagree, likely vs. not likely) for this exploratory exercise. Several indices of multiple survey items that were confirmed to have high internal consistency served as indicators of intermediate outcomes in the impact evaluation; however, for this application of ML techniques survey items were maintained as individual variables. Ethical review and approval for the evaluation study were provided by Innovations for Poverty Action IRB (protocol #15798) and the Ghana Health Service Ethics Review Committee (GHSERC: 003/11/20). Written informed consent to participate in the study was obtained from participants aged 18 years or older. For younger participants, written parental consent was first obtained followed by written assent of the adolescent.</p>
            </sec>
            <sec>
                <title>Building the classification model</title>
                <p>Initially, we trained a basic Random Forest classifier using the baseline data to identify girls who have ever smoked, which performed well in terms of accuracy (
                    <xref ref-type="table" rid="T1">Table 1</xref>). Accuracy is defined as the proportion of all girls who are correctly classified as either smokers or non-smokers (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>).</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <label>Table 1. </label>
                    <caption>
                        <title>Performance of classification algorithm using machine learning techniques.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Recall</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Without oversampling</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9594</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.2500</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.0448</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">RandomOverSampler</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9586</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.6494</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.3117</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">ADASYN</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9669</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.5445</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.2675</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">SMOTE</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9901</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.7900</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.7299</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Class weighting</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9174</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.3576</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.2595</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Hellinger Distance</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.9131</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.7413</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.6603</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Actual vs. predicted smoking status.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://gatesopenresearch-files.f1000.com/manuscripts/17698/727127f5-7199-49d0-a7b1-b2af59b1c543_figure1.gif"/>
                </fig>
                <p>To optimize our baseline model, we chose to focus on increasing the recall metric, rather than just accuracy. Recall is a measure of the model's ability to correctly identify actual smokers; a high recall value indicates that the model is able to identify most girls who smoke, while a low recall value suggests that the model is missing a significant number of smokers. Accuracy, which is affected by the low prevalence of smoking in our data, can bias the model towards classifying most individuals as non-smokers. Therefore, we were more concerned with correctly identifying those who did smoke to ensure effective programmatic targeting towards likely smokers, even if it meant that some non-smokers were mistakenly identified as smokers.</p>
                <p>Additionally, we focused on improving the model&#x2019;s precision. If the precision of the model is low, it means that the model is classifying many individuals as smokers even though they are not, which could result in potentially unnecessary or less well-targeted interventions. Therefore, improving precision helps to reduce targeting of non-smokers and ensures that the resources allocated to smoking interventions are still used effectively and efficiently.</p>
                <p>With the objective of optimizing the recall and precision, we evaluated the following ML approaches using the &#x201c;imbalance-library&#x201d; Python package
                    <sup>
                        <xref ref-type="bibr" rid="ref-3">3</xref>
                    </sup> (which is built upon the &#x201c;scikit-learn&#x201d; ML library) to determine which method produced the optimal outcome:</p>
                <list list-type="bullet">
                    <list-item>
                        <p>
                            <italic toggle="yes">RandomOverSampler</italic>: This technique works by randomly oversampling the minority class (smokers) in the dataset, i.e., it generates additional random samples of the minority class to balance the class distribution.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <italic toggle="yes">ADASYN (Adaptive Synthetic Sampling)</italic>
                            <sup>
                                <xref ref-type="bibr" rid="ref-4">4</xref>
                            </sup>: ADASYN works by generating synthetic samples of the minority class based on the density of the minority class in the feature space and the distance between minority samples. The goal is to increase the number of samples of the minority class in the training data while maintaining a balanced representation of the classes in the feature space.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <italic toggle="yes">SMOTE (Synthetic Minority Over-sampling Technique)</italic>
                            <sup>
                                <xref ref-type="bibr" rid="ref-5">5</xref>
                            </sup>: Like ADASYN, SMOTE works by generating synthetic samples of the minority class, but instead of generating samples based on the density and distance of the minority class, it generates samples based on the k-nearest neighbors of the minority samples. In this study 5 k-neighbors were used.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <italic toggle="yes">Class weighting</italic>: This method assigns higher weights to the minority class, encouraging the algorithm to pay more attention to the minority class and make better predictions.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <italic toggle="yes">Hellinger Distance as a Tree Split Criterion</italic>
                            <sup>
                                <xref ref-type="bibr" rid="ref-6">6</xref>
                            </sup>: Hellinger distance is a measure of the difference between two probability distributions. In decision tree algorithms, Hellinger distance can be used as a criterion for splitting nodes in the tree, to determine the most informative split that maximizes the difference between the classes in the feature space. Using Hellinger distance as a split criterion can improve the accuracy of decision tree algorithms in imbalanced classification tasks by giving more attention to the minority class.</p>
                    </list-item>
                </list>
                <p>We ultimately selected the SMOTE technique of generating synthetic data for training the classification algorithm because it optimized its performance in terms of recall and precision (
                    <xref ref-type="table" rid="T1">Table 1</xref>).</p>
            </sec>
            <sec>
                <title>Generating smoker profiles</title>
                <p>After successfully building a high-performing classification model, our goal was to build a profile of girls who smoke. However, ML models are often referred to as "black boxes" because they can be difficult to interpret and comprehend. This means that the model might make accurate predictions, but it's not clear how it arrived at those predictions or why a particular prediction was made.</p>
                <p>To gain more transparency, we utilized the technique of feature importance. This method explains predictions made by the model by identifying which features or input data had the most significant impact on those predictions. Essentially, it helps us understand which characteristics have the most influence in predicting a girl to become a smoker. By calculating feature importance we aimed to generate greater insight into how the model arrived at its decisions and to rank the most important variables. We determined whether an important feature is a risk for or protective against smoking based on the direction of that variable&#x2019;s association with the smoking outcome in our data.</p>
                <p>To determine the profiles of girls most likely to smoke, we trained our model by using the following combinations of target outcomes and input variables from our survey data:</p>
                <list list-type="bullet">
                    <list-item>
                        <label/>
                        <p>Model #1: Predicted outcome: reported ever tried smoking at baseline.</p>
                        <list list-type="bullet">
                            <list-item>
                                <label/>
                                <p>Training inputs: baseline survey responses</p>
                            </list-item>
                        </list>
                    </list-item>
                    <list-item>
                        <label/>
                        <p>Model #2: Predicted outcome: reported having smoked during the previous month at endline.</p>
                        <list list-type="bullet">
                            <list-item>
                                <label/>
                                <p>Training inputs: endline survey responses</p>
                            </list-item>
                        </list>
                    </list-item>
                    <list-item>
                        <label/>
                        <p>Model #3: Predicted outcome: among non-smokers at baseline, reported ever tried smoking at endline.</p>
                        <list list-type="bullet">
                            <list-item>
                                <label/>
                                <p>Training inputs: baseline survey responses</p>
                            </list-item>
                        </list>
                    </list-item>
                </list>
                <p>As a secondary analysis we looked at these same combinations within sub-groups of respondents, including younger (13&#x2013;15) vs. older girls (16&#x2013;19) and Accra vs. Kumasi, to support interpretation of findings and identify heterogeneity in predictors. We also analyzed additional combinations of data inputs on these predicted outcomes at different time points for robustness.</p>
            </sec>
        </sec>
        <sec sec-type="results">
            <title>Results</title>
            <p>Model #1 (
                <xref ref-type="table" rid="T2">Table 2</xref>) differentiates between girls based on smoking history prior to the study. The results highlight that girls who had ever tried smoking report different social settings and experiences at present than non-smokers. They are more likely to have regular phone access, recently attended a party, and/or been recently offered alcohol, which suggest more independence and social activity. They are also more likely to think their close friends have tried shisha, that their friends would approve of them smoking shisha, and to say that they would not likely refuse an offer of shisha from friends. Several indicators included in the evaluated program&#x2019;s theory of change, particularly those related to girls&#x2019; confidence in expressing preferences to friends, were not important to classifying likely smokers.</p>
            <table-wrap id="T2" orientation="portrait" position="anchor">
                <label>Table 2. </label>
                <caption>
                    <title>Model 1, reported ever smoking at baseline.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Variable Input</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Importance</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Went to a party last 30 days (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.128</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks most friends smoke shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.084</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Has regular access to phone (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.069</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Would refuse shisha from friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.053</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Was offered alcohol last 30 days (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.043</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would approve smoking shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.040</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks most other girls smoke shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.040</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Lives in Accra (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.034</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Age (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.033</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would shun if smoked shisha (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.032</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would shun if smoked cigarette (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.032</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Went to a bar last 30 days (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.028</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Would refuse shisha from boyfriend (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.024</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Lives with both parents (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.022</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Believes smoking shisha is harmful (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.022</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would approve smoking (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.021</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">In school (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.019</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Parents influence their decisions (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.017</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Has close circle of friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.017</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Believes not smoking shisha is important (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.016</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks people who say no are admired (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.014</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would shun if said no (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.013</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Boyfriend would shun if said no (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.013</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends influence their decisions (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.012</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Others think smoking cigarettes is guy-guy (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.012</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.012</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Others think smoking shisha is guy-guy (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.011</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Girls like other confident girls (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.011</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Boys like confident girls (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.010</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.010</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Believes smoking cigarettes is harmful (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.009</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Boys like girls who smoke shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.009</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Would refuse cigarette from friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.007</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Feels comfortable expressing likes/dislikes (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.007</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Stands firm with friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.006</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Would refuse cigarette from boyfriend (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.005</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks most other girls smoke cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.005</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Boys like girls who smoke cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.005</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Can tell friends if uncomfortable (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.005</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Believes not smoking cig is important (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.003</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Think most friends smoke cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.003</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Other girls influence their decisions (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.001</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p id="TFN1">Note: +/- indicate whether that variable is associated with increased or decreased risk of smoking</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <p>Given that smoking in this population is rare and infrequent, Model #2 (
                <xref ref-type="table" rid="T3">Table 3</xref>) identifies which factors are most important to differentiate recent smokers from recent non-smokers. Endline responses were used given very few recent smokers at baseline. Recent smokers are more likely to be older and live in the capital Accra and report lower parental influence than non-smokers, another sign of greater independence. They are more likely to report that smoking is common among their peers and that they would accept an offer to smoke from friends, and less likely to perceive shisha as harmful. The belief that people who say no are admired by others is one of the top factors predictive of being a recent smoker; this may suggest positive views of smoking as demonstrating independence or rebelliousness against norms.</p>
            <table-wrap id="T3" orientation="portrait" position="anchor">
                <label>Table 3. </label>
                <caption>
                    <title>Model 2, recent smoker at endline (abbreviated).</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Variable Input</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Importance</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Age (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.120</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks most other girls smoke cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.108</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Lives in Accra (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.106</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Would refuse cigarette from friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.090</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">People who say no are admired (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.079</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Parents influence decisions (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.060</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Thinks smoking shisha is harmful (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.045</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Went to a party last 30 days (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.035</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would shun if smoked shisha (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.031</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Has close circle of friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.030</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Think most friends smoke shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.029</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Lives with both parents (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.027</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.022</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.018</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p id="TFN2">Note: +/- indicate whether that variable is associated with increased or decreased risk of smoking</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <p>From Model #3 (
                <xref ref-type="table" rid="T4">Table 4</xref>) we profiled the small number of girls who were non-smokers at baseline but reported they had tried smoking by endline, in order to identify which factors are most important to set them apart from girls who remained non-smokers during the study period. We see that age and phone access were most important to subsequent smoking initiation, which suggests that these girls may already have had more independence and connectedness than other non-smoking girls. The girls that started smoking within the study period were also more likely to perceive that most social activities for girls their age involve shisha and that most girls their age have tried shisha, which suggests that they are also experiencing or perceiving a different social environment at baseline than other non-smokers. And these girls are more likely to report their friends have influence on their decisions, and less likely to report parent influence than other non-smokers, which suggests they might be more susceptible to social opportunity and persuasion.</p>
            <table-wrap id="T4" orientation="portrait" position="anchor">
                <label>Table 4. </label>
                <caption>
                    <title>Model 3, non-smoker at baseline, smoked by endline (abbreviated).</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Variable Input</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Importance</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Age (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.129</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Has regular access to phone (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.079</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends influence decisions (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.053</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">In school (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.050</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Lives with both parents (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.041</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.037</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Parents influence decisions (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.035</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Think other girls smoke shisha (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.033</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">People who say no are admired (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.031</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Has close circle of friends (-)</td>
                            <td align="center" colspan="1" rowspan="1">0.030</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">From Accra (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.028</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Most activities involve smoking cigarettes (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.028</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Friends would shun if said no (+)</td>
                            <td align="center" colspan="1" rowspan="1">0.027</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p id="TFN3">Note: +/- indicate whether that variable is associated with increased or decreased risk of smoking</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <p>In secondary analyses of Model 3, living in Accra and thinking other girls smoke shisha were the most important factors for smoking initiation among younger girls (13&#x2013;15) who were non-smokers at baseline. Having friends that influence decisions was less important, and school status was one of the least important, likely because very few younger girls are not in school. Among older girls, having friends or parents who influence decisions were the most important factors. This suggests that the spaces and behaviors girls are exposed to may be more relevant for smoking initiation among younger girls, whereas interpersonal influence matters more among older girls.</p>
        </sec>
        <sec sec-type="conclusions | discussion">
            <title>Conclusion/discussion</title>
            <p>We report here a novel application of ML techniques to generate insights around a behavior that is still relatively rare among urban teen girls in Ghana. Creating synthetic data through the SMOTE technique created a more balanced training data set for the classifier, and by opening up the black box of the algorithm we learned which features are most important towards predicting girls likely to be smokers.</p>
            <p>While very few girls in Ghana have tried smoking, even fewer report having smoked in the previous 30 days. Models 1 and 2 suggest both groups are more likely to be older and have the freedom to attend social events where they may be exposed to smoking opportunities or behaviors that shape their perceptions of norms among peers. Living in Accra, the more cosmopolitan and diverse of the two study cities, was also common among both groups. However, a perceived higher prevalence of cigarette smoking and willingness to accept cigarettes from friends are distinctively strong predictors of recent smokers. This may reflect that shisha is more commonly experimented with among girls this age and is more likely to be smoked on an infrequent basis (less than monthly). For that reason, most strong predictors of ever having smoked reflect beliefs about shisha, whereas recent smokers who smoke cigarettes, while still the minority, may have particularly strong and distinctive views on cigarette smoking. Other distinctively strong predictors of recent smoking include reporting low influence of parents and expressing admiration for people who are willing to go against trends. Parents in Ghana are generally highly protective of their teenage girls and there is a strong social norm against smoking
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>; these findings may reflect the absence of such parental figures in smokers&#x2019; lives, or their willful rebellion against them.</p>
            <p>Model 3 supports that changes in independence and the social environments shaping girls&#x2019; opportunities to smoke do in fact precede initiation and experimentation. Cross-sectional surveys can support associations between smoking behavior and social environment; the Model 3 classifier leverages survey data from two timepoints to highlight the pre-existing differences among non-smoking girls that are most predictive of subsequent smoking initiation. Among younger girls, this may reflect different home environments, where girls have more mobility and less parental oversight, putting some girls at more risk of smoking opportunities. These differences may also explain why older age is such a strong predictor of smoking, as adolescent girls are often given more freedom to attend social events and activities as they age. Furthermore, many girls in urban Ghana enroll in a boarding school for their Senior High School (SHS) education where they may gain exposure to girls from less sheltered backgrounds who may influence subsequent opportunities for smoking
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>.</p>
            <p>Such insights about the relative importance of different features to a target behavior can be valuable input for program planning and outreach that is responsive to a specific population and context. This is especially true in the case of programs aimed at smoking prevention, given that smoking in teens is addictive and the literature suggests that people who avoid smoking in adolescence are unlikely to ever start
                <sup>
                    <xref ref-type="other" rid="FNvi">vi</xref>
                </sup>. Better understanding of the risk factors for recent and future smoking behavior, and in particular the role of the social environment, can suggest programmatic opportunities for targeting adolescents most at risk or exploring more promising programmatic directions, such as reshaping the environments they are exposed to.</p>
            <p>Although the smoker profiles generated through this approach are limited by the inputs we provide the machine, i.e., our survey data, they were largely consistent with our formative qualitative research findings around influences on girls&#x2019; behavior in urban Ghana
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>. The large sample size of this study obtained through cluster randomized sampling is one of its strengths, although there are potential limitations related to collecting data on sensitive topics from adolescent girls, as discussed in that study
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>. Data collection tools in both cases were informed by insights from the tobacco literature and behavioral science; we believe grounding survey data collection in strong behavioral science theory strengthened the utility of the ML outputs and resulted in better alignment of findings with more in-depth qualitative research techniques. This novel application of ML techniques demonstrates the potential synergism between data science and behavioral science to generate insights about predictors of behavior and highlights the importance of basing quantitative data collection in behavioral theory, especially if opportunities for rich qualitative investigation are limited in other settings.</p>
        </sec>
    </body>
    <back>
        <sec sec-type="data-availability">
            <title>Data availability</title>
            <sec>
                <title>Source data</title>
                <p>The data that support the findings of this study are not publicly available due to privacy and ethical restrictions but are available from the corresponding author on reasonable request. A reasonable request would be from a legitimate party with a specific research objective, would not violate the privacy or ethical protections of participants, and would not require additional reformatting or repackaging of the data by the authors.</p>
            </sec>
            <sec>
                <title>Extended data</title>
                <p>The baseline survey questions are available here: 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.24581616">https://doi.org/10.6084/m9.figshare.24581616</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref-7">7</xref>
                    </sup>
                </p>
                <p>This project contains the following extended data:</p>
                <list list-type="bullet">
                    <list-item>
                        <label>- </label>
                        <p>Baseline Survey_Final.pdf</p>
                    </list-item>
                </list>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International license</ext-link> (CC-BY 4.0).</p>
            </sec>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>The RandomForestClassifier that was used for classification tasks belongs to the open-source Python scikit-learn library
                <sup>
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup> available at 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/scikit-learn/scikit-learn">https://github.com/scikit-learn/scikit-learn</ext-link>, with documentation here: 
                <ext-link ext-link-type="uri" xlink:href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html</ext-link>. The feature importances are provided by the "feature_importances" attribute of the scikit-learn's RandomForestClassifier, which offers a way to assess the importance of each feature in making accurate predictions with the random forest model.</p>
            <p>All the oversampling methods were performed using open-source Python imbalanced-learn library
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>
                </sup> hosted at 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/scikit-learn-contrib/imbalanced-learn">https://github.com/scikit-learn-contrib/imbalanced-learn</ext-link> with documentation here: 
                <ext-link ext-link-type="uri" xlink:href="https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html">https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html</ext-link>.</p>
        </sec>
        <ack>
            <title>Acknowledgements</title>
            <p>We wish to thank the survey participants and JMK Consulting Limited for assisting with data collection. We are also grateful to Good Business and Now Available Africa for their partnership and input into the survey instruments. We also thank Lois Aryee and Jeremy Barofsky of ideas42 for their contributions to the research activities of this project, and Jean Paullin of the Bill &amp; Melinda Gates Foundation for her continued support and guidance on these efforts.</p>
        </ack>
        <fn-group>
            <fn>
                <p id="FNi">

                    <sup>i</sup> Stone E, Peters M. Young low and middle-income country (LMIC) smokers&#x2014;implications for global tobacco control. Transl Lung Cancer Res. 2017 Dec;6(Suppl 1):S44&#x2013;6.</p>
                <p id="FNii">

                    <sup>ii</sup> WHO | Regional Office for Africa [Internet]. [cited 2022 Oct 11]. Tobacco Control. Available from: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.afro.who.int/health-topics/tobacco-control">https://www.afro.who.int/health-topics/tobacco-control</ext-link>
                </p>
                <p id="FNiii">

                    <sup>iii</sup> Agaku IT, Sulentic R, Dragicevic A, Njie G, Jones CK, Odani S, 
                    <italic toggle="yes">et al.</italic> Gender differences in use of cigarette and non-cigarette tobacco products among adolescents aged 13&#x2013;15 years in 20 African countries. Tob Induc Dis. 2024 Jan 22;22:10.18332/tid/169753.</p>
                <p id="FNiv">

                    <sup>iv</sup> Logo DD, Kyei-Faried S, Oppong FB, Ae-Ngibise KA, Ansong J, Amenyaglo S, 
                    <italic toggle="yes">et al.</italic> Waterpipe use among the youth in Ghana: Lessons from the Global Youth Tobacco Survey (GYTS) 2017. Tob Induc Dis. 2020 May 29;18:47.</p>
                <p id="FNv">

                    <sup>v</sup> Global Youth Tobacco Survey Collaborative Group. Global Youth Tobacco Survey (GYTS): Core Questionnaire with Optional Questions, Version 1.2. Atlanta, GA: Centers for Disease Control and Prevention, 2014.</p>
                <p id="FNvi">

                    <sup>vi</sup> Tyas SL, Pederson LL. Psychosocial factors related to adolescent smoking: a critical review of the literature. Tob Control. 1998 Dec 1;7(4):409&#x2013;20.</p>
            </fn>
        </fn-group>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jordan</surname>
                            <given-names>MI</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Mitchell</surname>
                            <given-names>TM</given-names>
                        </name>
</person-group>:
                    <article-title>Machine learning: trends, perspectives, and prospects.</article-title>
                    <source>

                        <italic toggle="yes">Science.</italic>
</source>
                    <year>2015</year>;<volume>349</volume>(<issue>6245</issue>):<fpage>255</fpage>&#x2013;<lpage>260</lpage>.
                    <pub-id pub-id-type="pmid">26185243</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.aaa8415</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Aryee</surname>
                            <given-names>LNA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Flanagan</surname>
                            <given-names>SV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Trupe</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Social norms and social opportunities: a qualitative study of influences on tobacco use among urban adolescent girls in Ghana.</article-title>
                    <source>

                        <italic toggle="yes">BMC Public Health.</italic>
</source>
                    <year>2024</year>;<volume>24</volume>(<issue>1</issue>): 2978.
                    <pub-id pub-id-type="pmid">39468503</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s12889-024-20413-z</pub-id>
                    <pub-id pub-id-type="pmcid">11514744</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Lemaitre</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nogueira</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Aridas</surname>
                            <given-names>CK</given-names>
                        </name>
</person-group>:
                    <article-title>Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning.</article-title>
                    <source>

                        <italic toggle="yes">J Mach Learn Res.</italic>
</source>
                    <year>2017</year>;<volume>18</volume>(<issue>1</issue>):<fpage>559</fpage>&#x2013;<lpage>563</lpage>.
                    <ext-link ext-link-type="uri" xlink:href="https://dl.acm.org/doi/abs/10.5555/3122009.3122026">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="confproc">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>He</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bai</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Garcia</surname>
                            <given-names>EA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>ADASYN: adaptive synthetic sampling approach for imbalanced learning.</article-title>In:
                    <italic toggle="yes">2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).</italic>IEEE;<year>2008</year>;<fpage>1322</fpage>&#x2013;<lpage>1328</lpage>.
                    <pub-id pub-id-type="doi">10.1109/IJCNN.2008.4633969</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chawla</surname>
                            <given-names>NV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bowyer</surname>
                            <given-names>KW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hall</surname>
                            <given-names>LO</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>SMOTE: Synthetic Minority Over-sampling Technique.</article-title>
                    <source>

                        <italic toggle="yes">J Artif Intell Res.</italic>
</source>
                    <year>2002</year>;<volume>16</volume>(<issue>1</issue>):<fpage>321</fpage>&#x2013;<lpage>357</lpage>.
                    <pub-id pub-id-type="doi">10.1613/jair.953</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cieslak</surname>
                            <given-names>DA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hoens</surname>
                            <given-names>TR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chawla</surname>
                            <given-names>NV</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Hellinger distance decision trees are robust and skew-insensitive.</article-title>
                    <source>

                        <italic toggle="yes">Data Min Knowl Disc.</italic>
</source>
                    <year>2012</year>;<volume>24</volume>(<issue>1</issue>):<fpage>136</fpage>&#x2013;<lpage>158</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s10618-011-0222-1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Flanagan</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vargas</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Smith</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <data-title>Application of machine learning techniques to profile smoking behavior of adolescent girls in Ghana: baseline questionnaire.</data-title>
                    <source>

                        <italic toggle="yes">figshare.</italic>
</source>[Data],<year>2023</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.doi.org/10.6084/m9.figshare.24581616.v1">http://www.doi.org/10.6084/m9.figshare.24581616.v1</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pedregosa</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Varoquaux</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gramfort</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Scikit-learn: 
                        <italic toggle="yes">machine learning in python</italic>.</article-title>
                    <source>

                        <italic toggle="yes">J Mach Learn Res.</italic>
</source>
                    <volume>12</volume>:<fpage>2825</fpage>&#x2013;<lpage>2830</lpage>.
                    <ext-link ext-link-type="uri" xlink:href="https://scikit-learn.org/stable/index.html">Reference Source</ext-link>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report39728">
        <front-stub>
            <article-id pub-id-type="doi">10.21956/gatesopenres.17698.r39728</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Wang</surname>
                        <given-names>Runqiu</given-names>
                    </name>
                    <xref ref-type="aff" rid="r39728a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r39728a1">
                    <label>1</label>University of Nebraska Medical Center, Omaha, Nebraska, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>26</day>
                <month>8</month>
                <year>2025</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Wang R</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport39728" related-article-type="peer-reviewed-article" xlink:href="10.12688/gatesopenres.14991.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper "Application of machine learning techniques to profile smoking behavior of adolescent girls in Ghana" applies machine learning (ML) methods, mostly random forest classifier, to identify predictors of smoking initiation among adolescent girls using data from a stepped wedge cluster randomized trial and panel survey involving 9,000 girls aged 13-19 years. The study leverages the Synthetic Minority Over-sampling Technique (SMOTE) to optimize model precision and recall and employs feature importance techniques to enhance interpretability.</p>
            <p> </p>
            <p> Below are my comments:</p>
            <p> 1.Introduction: Strengthen the introduction by briefly discussing current tobacco use prevalence among adolescents in Ghana and the gender gap trend. Cite recent literature to contextualize the relevance and urgency of this issue, for example:</p>
            <p> a. (Ref 1)</p>
            <p> b. (Ref 2)</p>
            <p> </p>
            <p> 2. Methods:</p>
            <p> (1).Suggest including results from a conventional logistic regression model as a baseline comparison. This would provide a useful benchmark and highlight the specific advantages or limitations of the Random Forest classifier in this context.</p>
            <p> (2).While the paper presents itself as applying machine learning broadly, only the Random Forest classifier is used. The authors may consider exploring additional ML models: such as gradient boosting machines (e.g., XGBoost), support vector machines to assess model robustness and validate the consistency of key predictors. This would strengthen the claim that machine learning approaches (not just Random Forests) offer valuable insights in this context.</p>
            <p> (3) The manuscript does not describe the validation strategy used for model training and evaluation. It is unclear whether cross-validation, train/test split, or another method was used to compute the reported metrics (accuracy, precision, recall). Clarifying the validation framework and specifying how these metrics were calculated would strengthen the credibility and reproducibility of the results.</p>
            <p> (4) SMOTE should only be applied to the training set to avoid data leakage. The authors should clarify whether SMOTE was applied correctly after splitting the data and not on the entire dataset. Improper use could compromise the validity of performance metrics and model generalizability.</p>
            <p> </p>
            <p> 3. Results:</p>
            <p> The reported feature directions (positive or negative associations) are not derived from the Random Forest model itself, as standard feature importance metrics do not provide directionality. The authors should clarify how these directions were determined, e.g., through separate bivariate analyses, and discuss the limitations of inferring directionality outside the trained model context. Alternatively, the authors may consider using SHAP (SHapley Additive exPlanations). SHAP provides both the direction and magnitude of each feature&#x2019;s contribution to a prediction, offering a unified and theoretically grounded method for interpreting machine learning models.</p>
            <p> 4.&#x00a0;Conclusion/Discussion</p>
            <p> I have no comments here, authors did a good job for this part.</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Yes</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Partly</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Biostatistics, Machine Learning and Deep Learning, multiple testing in high-dimensional data, infectious disease</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-39728-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Social norms and social opportunities: a qualitative study of influences on tobacco use among urban adolescent girls in Ghana</article-title>.
                        <source>
                            <italic>BMC Public Health</italic>
                        </source>.<year>2024</year>;<volume>24</volume>(<issue>1</issue>) :
                        <elocation-id>10.1186/s12889-024-20413-z</elocation-id>
                        <pub-id pub-id-type="doi">10.1186/s12889-024-20413-z</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-39728-2">
                    <label>2</label>
                    <mixed-citation>
                        <person-group person-group-type="author"/>:
                        <article-title>Profile and predictors of adolescent tobacco use in Ghana: evidence from the 2017 Global Youth Tobacco Survey (GYTS)</article-title>.
                        <source>
                            <italic>J Prev Med Hyg.</italic>
                        </source>.
                        <elocation-id>10.15167/2421-4248/jpmh2021.62.3.2035</elocation-id>
                        <pub-id pub-id-type="doi">10.15167/2421-4248/jpmh2021.62.3.2035</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report38488">
        <front-stub>
            <article-id pub-id-type="doi">10.21956/gatesopenres.17698.r38488</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Sun</surname>
                        <given-names>Ruoyan</given-names>
                    </name>
                    <xref ref-type="aff" rid="r38488a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8412-7727</uri>
                </contrib>
                <aff id="r38488a1">
                    <label>1</label>The University of Alabama at Birmingham, Birmingham, Alabama, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>11</day>
                <month>12</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Sun R</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport38488" related-article-type="peer-reviewed-article" xlink:href="10.12688/gatesopenres.14991.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors have addressed most of my previous comments. I have two minor questions.&#x00a0;</p>
            <p> </p>
            <p> 1. In the Introduction, new references are added as footnotes (Footnote I to VI) but not cited. Is this because of limited references allowed (up to 8)? If there is space for more than 8 references, then the authors should cite these new references properly.&#x00a0;</p>
            <p> </p>
            <p> 2. Tysa &amp; Pederson (Footnote vi) was published in 1998 and smoking behaviors have changed substantially since then in many countries. Is there any more recent references that can be used?</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Yes</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Tobacco control; policy evaluation; health economics; modeling and simulation.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report36854">
        <front-stub>
            <article-id pub-id-type="doi">10.21956/gatesopenres.16324.r36854</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Murray</surname>
                        <given-names>Jennifer M.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r36854a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0622-8631</uri>
                </contrib>
                <aff id="r36854a1">
                    <label>1</label>Queens University Belfast, Belfast, England, UK</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>17</day>
                <month>7</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Murray JM</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport36854" related-article-type="peer-reviewed-article" xlink:href="10.12688/gatesopenres.14991.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This Research Note applies machine learning (ML) techniques to identify the most important variables for predicting initiation of smoking amongst 9000 adolescent girls (aged 13-19 years) participating in a social marketing program in Ghana. The data was collected through a stepped wedge cluster randomized trial with panel survey over four time-points during 12 months of 2021-2022. The results show the importance of adolescent girls' independence and connectivity, social environment, and peer influence for predicting previous smoking behavior, and subsequent smoking initiation. The study design and methods are appropriate for answering the research questions, and the work is technically sound. The statistical analysis and its interpretation are appropriate, and sufficient details of the methods and analysis have been provided to allow replication. The conclusions drawn are adequately supported by the results. The source data underlying the results are not publicly available due to privacy and ethical restrictions. However, the authors have said that they will provide the data on reasonable request. The authors have also provided a link to the survey questions in the "Extended data" section. Overall, the work is clearly and accurately presented. However, there is little reference to the current literature on adolescent smoking initiation or the behavioral theory underlying the study and outcome measures. I believe that the work is OK to be indexed&#x00a0;in its current form, but I have chosen the "approved with reservations" approval status because I have several minor comments that I think could improve the published product.</p>
            <p> </p>
            <p> Key strengths: Large sample size, methods are appropriate for answering the research questions, comprehensive dataset with outcome variables and predictors informed by behavioral theory and qualitative research findings.</p>
            <p> </p>
            <p> Key weaknesses: Little discussion of the current literature on adolescent smoking or the relevant research on behavioral theory.</p>
            <p> </p>
            <p> Comments:</p>
            <p> &#x00a0; 
                <list list-type="order">
                    <list-item>
                        <p>There is little discussion on the current literature on adolescent smoking in the introduction section of the main text. For example, although the background section of the abstract mentions tobacco trends in low- and middle-income countries, and narrowing gender gaps, this is not expanded upon in the main text and there is no mention of the adolescent smoking prevalence (for girls in Ghana). I appreciate that the article is a Research Note and therefore has a limited word count, but I would still expect to see some reference to the issue of adolescent smoking rates in the introduction section.</p>
                    </list-item>
                    <list-item>
                        <p>Similarly, throughout the paper the authors highlight that their survey items (and the ML inputs) were informed by the formative/qualitative research findings, and the behavior change theory underlying the evaluated program. However, the qualitative findings or the behavioral theory have not been summarized anywhere. I think it would be interesting to see a brief overview of the social marketing program, and how it relates to the study's outcome measures, in the methods section.</p>
                    </list-item>
                    <list-item>
                        <p>At the end of the methods section, you state that you conducted some secondary analyses using the same data inputs and predicted outcomes within subgroups of the respondents (including younger versus older girls, and different regions), and also used different combinations of data inputs and predicted outcomes. However, you were not very specific about these analyses, and the results are not reported. I suggest you should add specific details about these analyses (e.g., what age groups and regions were compared, what alternative combinations of data inputs and predicted outcomes were used) and summarize the findings in the results section. You could also upload the results tables for the secondary analyses to a repository and provide the link in the "Extended data" section. I do not think this should be left as it is because it is vague, and the results have not been discussed anywhere in the article. It would be interesting to see the results if you had used the "intentions for smoking cigarettes or shisha over the next 30 days" outcome at the end of the survey.</p>
                    </list-item>
                    <list-item>
                        <p>Similarly, in the methods section you state that you recoded some of your data's Likert scales to binary. This would cause some loss of information. Have you considered presenting alternative (sensitivity) analyses to determine the impact of using the original uncategorized outcome variables?</p>
                    </list-item>
                    <list-item>
                        <p>Although you have provided a link to the full survey in the "Extended data" section, I think you could provide some more information on the outcome measures in the methods section (e.g., insert the relevant references if they have been adapted from previous research studies, and describe whether they have been validated for use with your research population).</p>
                    </list-item>
                    <list-item>
                        <p>The study's strengths and limitations have not been discussed in the discussion section.</p>
                    </list-item>
                    <list-item>
                        <p>On page 4, where you provide an overview of the ML techniques, you state that the "RandomOverSampler" technique works by "randomly oversampling the minority class (non-smokers)". Are smokers not the minority class?</p>
                    </list-item>
                    <list-item>
                        <p>Where you describe the SMOTE technique (the optimal technique whose results you ultimately report on), you say that it works by generating samples based on the k-nearest neighbors of the minority samples. What was "k" in your models?</p>
                    </list-item>
                    <list-item>
                        <p>In your models, "lives in Accra" is frequently an important predictor. Can you comment on this result (e.g., the differences between Accra and the other communities in your sample, and why adolescents living in Accra should have increased risk of smoking)?</p>
                    </list-item>
                </list>
            </p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Yes</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Partly</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Public health, health behavior change, physical activity, adolescent smoking, peer influence, mediation analyses, SIENA modelling</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment3749-36854">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Flanagan</surname>
                            <given-names>Sara</given-names>
                        </name>
                        <aff>ideas42, USA</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>19</day>
                    <month>11</month>
                    <year>2024</year>
                </pub-date>
            </front-stub>
            <body>
                <p>
                    <list list-type="order">
                        <list-item>
                            <p>There is little discussion on the current literature on adolescent smoking in the introduction section of the main text. For example, although the background section of the abstract mentions tobacco trends in low- and middle-income countries, and narrowing gender gaps, this is not expanded upon in the main text and there is no mention of the adolescent smoking prevalence (for girls in Ghana). I appreciate that the article is a Research Note and therefore has a limited word count, but I would still expect to see some reference to the issue of adolescent smoking rates in the introduction section. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response</bold>: This background info has been added to the Introduction section of the main text along with citations.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>Similarly, throughout the paper the authors highlight that their survey items (and the ML inputs) were informed by the formative/qualitative research findings, and the behavior change theory underlying the evaluated program. However, the qualitative findings or the behavioral theory have not been summarized anywhere. I think it would be interesting to see a brief overview of the social marketing program, and how it relates to the study's outcome measures, in the methods section. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response</bold>: The qualitative study has just recently been published and the citation has been updated. That paper includes more explanation of the context for the study and an extensive discussion of the findings relative to the literature.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>At the end of the methods section, you state that you conducted some secondary analyses using the same data inputs and predicted outcomes within subgroups of the respondents (including younger versus older girls, and different regions), and also used different combinations of data inputs and predicted outcomes. However, you were not very specific about these analyses, and the results are not reported. I suggest you should add specific details about these analyses (e.g., what age groups and regions were compared, what alternative combinations of data inputs and predicted outcomes were used) and summarize the findings in the results section. You could also upload the results tables for the secondary analyses to a repository and provide the link in the "Extended data" section. I do not think this should be left as it is because it is vague, and the results have not been discussed anywhere in the article. It would be interesting to see the results if you had used the "intentions for smoking cigarettes or shisha over the next 30 days" outcome at the end of the survey. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>We have updated the methods to specify the age groups (13-15 vs. 16-19) and regions (Accra vs. Kumasi), and summarized the age subgroup results for model 3 at the end of the results section. We also edited the Methods so that it does not seem like we explored additional outcomes (we did not explore intention to smoke), rather we looked at the same outcomes (ever smoking, recent smoking) at different time points.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>Similarly, in the methods section you state that you recoded some of your data's Likert scales to binary. This would cause some loss of information. Have you considered presenting alternative (sensitivity) analyses to determine the impact of using the original uncategorized outcome variables? 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>SMOTE uses Euclidean distance to figure out how close minority samples are to each other, but can struggle with high-dimensional or ordinal data, like raw Likert scales, because of a problem called &#x201c;distance distortion.&#x201d; In high-dimensional spaces, data points are more spread out, which makes their distances more similar and makes it hard to find true nearest neighbors. This distortion could make a sensitivity analysis confusing, since it will be difficult to know if any differences are from real predictive value or just from high-dimensional effects. Moreover, transforming the Likert scale variables was also to make interpretation of the feature importance technique easier, so using the original scales again may not add much value, given potential distortion, and could reintroduce complexity to the interpretation.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>Although you have provided a link to the full survey in the "Extended data" section, I think you could provide some more information on the outcome measures in the methods section (e.g., insert the relevant references if they have been adapted from previous research studies, and describe whether they have been validated for use with your research population). 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>We added a comment that the tobacco use questions were adapted from the Global Youth Tobacco Survey, and cited the formative research, which is the context in which the questionnaire was tested in this population.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>The study's strengths and limitations have not been discussed in the discussion section. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>The final paragraph of the discussion mentions the limitations of data inputs and the strengths of grounding data collection in behavioral science. We have expanded there to mention the large sample size as a strength and referenced a longer discussion of limitations related to collecting data from this study population in our formative research paper, to save on space here.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>On page 4, where you provide an overview of the ML techniques, you state that the "RandomOverSampler" technique works by "randomly oversampling the minority class (non-smokers)". Are smokers not the minority class? 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>This was a typo and has been corrected, we thank the reviewer for catching it.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>Where you describe the SMOTE technique (the optimal technique whose results you ultimately report on), you say that it works by generating samples based on the k-nearest neighbors of the minority samples. What was "k" in your models? 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>We used 5 k-neighbors and added this detail to the description of the technique.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>In your models, "lives in Accra" is frequently an important predictor. Can you comment on this result (e.g., the differences between Accra and the other communities in your sample, and why adolescents living in Accra should have increased risk of smoking)? 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response: </bold>We have added a comment to the results discussion that Accra is the more cosmopolitan and diverse of the two study cities.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report36278">
        <front-stub>
            <article-id pub-id-type="doi">10.21956/gatesopenres.16324.r36278</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Sun</surname>
                        <given-names>Ruoyan</given-names>
                    </name>
                    <xref ref-type="aff" rid="r36278a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8412-7727</uri>
                </contrib>
                <aff id="r36278a1">
                    <label>1</label>The University of Alabama at Birmingham, Birmingham, Alabama, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>16</day>
                <month>5</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Sun R</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport36278" related-article-type="peer-reviewed-article" xlink:href="10.12688/gatesopenres.14991.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Applying novel machine learning techniques to a panel survey of 9000 girls aged 13-19 in Ghana, this study identified important variables that predict smoking initiation among adolescent girls. Strengths of the study include clearly written descriptions of various machine learning approaches and model selection based on performance measures (accuracy vs prevision vs recall). However, there are a few problems.</p>
            <p> </p>
            <p> 1. There is a lack of information or justification of why smoking among adolescent girls is an important issue in Ghana. 
                <list list-type="bullet">
                    <list-item>
                        <p>In the background part of the abstract, the authors mentioned &#x201c;&#x2026;, and in particular narrowing gender gaps, highlight the need for interventions to prevent and/or reduce tobacco use among adolescent girls&#x201d;. It is not clear what narrowing gender gaps mean here. Are we saying boys have higher smoking prevalence and girls are catching up? If this is the case, then in the Introduction, the authors need to expand on this part with proper citations.</p>
                    </list-item>
                    <list-item>
                        <p>In the last paragraph of the Introduction, the authors mentioned &#x201c;only 1.6% of girls reported ever tried smoking at baseline and 0.1% having smoked in the past 30 days.&#x201d; These rates are extremely low, especially compared to those in developing countries. For example, past 30-day cigarette smoking among middles school students in the US was 1.1% in 2023.
                            <sup>1</sup> Why is smoking among adolescent girls in Ghana an important topic to study? Has the smoking prevalence increased in the recent years? The authors need to add more background information and justification.</p>
                    </list-item>
                </list> 2. While the machine learning techniques are cool and novel, it is not clear how the findings advance our existing knowledge. Many of the factors identified, such as peer influence (is smoking common among friends) and exposure to smoking (most activities involve smoking cigarettes), are well-known factors that are associated with smoking initiation. The authors need to compare their results with existing literature and highlight their contribution. This can be done by adding a paragraph or two in the Discussion.</p>
            <p> </p>
            <p> 3. To demonstrate the advantage or benefits of machine learning techniques, the authors could conduct the same analysis using convention regression approaches and compare the results. For example, logistic regressions can also identify risk factors that are significantly associated with ever smoking or past 30-day smoking.</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Yes</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Tobacco control; policy evaluation; health economics; modeling and simulation.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-36278-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Tobacco Product Use Among U.S. Middle and High School Students - National Youth Tobacco Survey, 2023.</article-title>
                        <source>
                            <italic>MMWR Morb Mortal Wkly Rep</italic>
                        </source>.<year>2023</year>;<volume>72</volume>(<issue>44</issue>) :
                        <elocation-id>10.15585/mmwr.mm7244a1</elocation-id>
                        <fpage>1173</fpage>-<lpage>1182</lpage>
                        <pub-id pub-id-type="pmid">37917558</pub-id>
                        <pub-id pub-id-type="doi">10.15585/mmwr.mm7244a1</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment3748-36278">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Flanagan</surname>
                            <given-names>Sara</given-names>
                        </name>
                        <aff>ideas42, USA</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>19</day>
                    <month>11</month>
                    <year>2024</year>
                </pub-date>
            </front-stub>
            <body>
                <p>1. There is a lack of information or justification of why smoking among adolescent girls is an important issue in Ghana. 
                    <list list-type="bullet">
                        <list-item>
                            <p>In the background part of the abstract, the authors mentioned &#x201c;&#x2026;, and in particular narrowing gender gaps, highlight the need for interventions to prevent and/or reduce tobacco use among adolescent girls&#x201d;. It is not clear what narrowing gender gaps mean here. Are we saying boys have higher smoking prevalence and girls are catching up? If this is the case, then in the Introduction, the authors need to expand on this part with proper citations. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response</bold>: We have expanded on this background information in the Introduction section and added citations.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                        <list-item>
                            <p>In the last paragraph of the Introduction, the authors mentioned &#x201c;only 1.6% of girls reported ever tried smoking at baseline and 0.1% having smoked in the past 30 days.&#x201d; These rates are extremely low, especially compared to those in developing countries. For example, past 30-day cigarette smoking among middles school students in the US was 1.1% in 2023.
                                <sup>1</sup>&#x00a0;Why is smoking among adolescent girls in Ghana an important topic to study? Has the smoking prevalence increased in the recent years? The authors need to add more background information and justification. 
                                <list list-type="bullet">
                                    <list-item>
                                        <p>
                                            <bold>Response</bold>: Other studies had suggested increasing tobacco use, and in particular shisha, among adolescent girls in Ghana. We have expanded on this background information in the Introduction section and added citations.</p>
                                    </list-item>
                                </list> </p>
                        </list-item>
                    </list> 2. While the machine learning techniques are cool and novel, it is not clear how the findings advance our existing knowledge. Many of the factors identified, such as peer influence (is smoking common among friends) and exposure to smoking (most activities involve smoking cigarettes), are well-known factors that are associated with smoking initiation. The authors need to compare their results with existing literature and highlight their contribution. This can be done by adding a paragraph or two in the Discussion. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>Response</bold>: We recognize the word limit for research notes is a constraint to significantly expanding the Discussion, but we have edited the text to specify that these insights have programmatic value for planning interventions that are responsive to this specific population and context. We also encourage readers to refer to our recently published qualitative research study (cited) for a more extensive discussion of these findings relative to the existing tobacco literature.</p>
                        </list-item>
                    </list> 3. To demonstrate the advantage or benefits of machine learning techniques, the authors could conduct the same analysis using convention regression approaches and compare the results. For example, logistic regressions can also identify risk factors that are significantly associated with ever smoking or past 30-day smoking. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>Response</bold>: Comparing our model against a logistic regression model may not be the best comparison, since Random Forests can better leverage the additional samples created through SMOTE as it builds multiple trees on different subsets of the data, which may help to generalize better. Random Forests also provide additional benefits over logistic regression. Random Forests provide insights into the importance of different predictors, allowing you to identify which features contribute most to the prediction, whereas logistic regression provides coefficients that can be harder to interpret, especially with interactions. Random Forests are also less prone to overfitting, especially in complex datasets, because they aggregate predictions from multiple trees, which helps to smooth out noise.</p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
</article>
