Introduction

Gates Open Res

Gates Open Research

2572-4754

F1000 Research Limited

London, UK

10.12688/gatesopenres.14416.1

Software Tool Article

Articles

Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness

[version 1; peer review: 1 approved with reservations]

Wood

Thomas A

Software Validation Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0001-8962-8571 a 1 McNair

Douglas

Conceptualization Funding Acquisition Methodology Project Administration Resources Supervision Validation https://orcid.org/0000-0003-0965-883X 2 1Fast Data Science Ltd, London, England, N5 2UP, UK 2Integrated Development, Bill & Melinda Gates Foundation, Seattle, Washington, 98109, USA

a thomas@fastdatascience.com

No competing interests were disclosed.

18 4 2023

2023

3 4 2023

2023

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background: A large proportion of clinical trials end without delivering results that are useful for clinical, policy, or research decisions. This problem is called “uninformativeness”. Some high-risk indicators of uninformativeness can be identified at the stage of drafting the protocol, however the necessary information can be hard to find in unstructured text documents.

Methods: We have developed a browser-based tool which uses natural language processing to identify and quantify the risk of uninformativeness. The tool reads and parses the text of trial protocols and identifies key features of the trial design, which are fed into a risk model. The application runs in a browser and features a graphical user interface that allows a user to drag and drop the PDF of the trial protocol and visualize the risk indicators and their locations in the text. The user can correct inaccuracies in the tool’s parsing of the text. The tool outputs a PDF report listing the key features extracted. The tool is focused HIV and tuberculosis trials but could be extended to more pathologies in future.

Results: On a manually tagged dataset of 300 protocols, the tool was able to identify the condition of a trial with 100% area under curve (AUC), presence or absence of statistical analysis plan with 87% AUC, presence or absence of effect estimate with 95% AUC, number of subjects with 69% accuracy, and simulation with 98% AUC. On a dataset of 11,925 protocols downloaded from ClinicalTrials.gov, the tool was able to identify trial phase with 75% accuracy, number of arms with 58% accuracy, and the countries of investigation with 87% AUC.

Conclusion: We have developed and validated a natural language processing tool for identifying and quantifying risks of uninformativeness in clinical trial protocols. The software is open-source and can be accessed at the following link: https://app.clinicaltrialrisk.org/

Clinical trial protocol risk natural language processing uninformativeness

Gates Foundation

INV-050345

This work was supported, in whole or in part, by the Gates Foundation [INV-050345].

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction Uninformative trials

The goal of conducting a clinical trial is to produce evidence that can inform clinical and policy decisions. Each year, more than half a million clinical trials are run ¹, each one seeking to gather information on an intervention such as a drug, device, or behavioral intervention.

However, the majority of clinical trials do not prove or disprove a hypothesis ². A 2022 study of 125 clinical trials in heart disease, diabetes, and lung cancer by Hutchinson et al. ³ found that just over a quarter ended informatively, and other estimates are even lower ⁴.

The Declaration of Helsinki, a key set of ethical principles on clinical trials, states that “Medical research involving human subjects may only be conducted if the importance of the objective outweighs the risks and burdens to the research subjects” ⁵. In other words, research is considered unethical if there is no benefit to science.

In 2019, Deborah Zarin and colleagues addressed the problem of uninformative clinical trials in the Journal of the American Medical Association ⁶. Zarin et al. stated that for a trial to be informative, it must fulfill five conditions:

The study hypothesis must address an important and unresolved question

The study must be designed to provide evidence related to this question

The study must be feasible

The study must be conducted and analyzed in a scientifically valid manner

The study must report results accurately and promptly.

When a trial does not fulfill all of the above conditions, it is likely to be uninformative. That means that the time and money spent on the trial, not to mention the subjects’ good intentions on enrolling, were wasted.

The most common reason why trials end uninformatively is due to underpowering, or inadequate sample size. Other common reasons are safety and commercial factors ^{7,
8}.

At the stage of drafting the trial protocol, it is possible to identify a number of indicators of a high risk of uninformativeness. These include a smaller than typical sample size, a lack of statistical analysis plan, use of non-standard endpoints, or the use of cluster randomisation. Low-risk trials are often run by well-known institutions with external funding and an international or intercontinental array of sites. These indicators can be referred to as features or parameters.

Definitions of uninformativeness

In contrast to easily measured metrics, such as cost, informativeness is a somewhat subjective and even philosophical concept. One definition is that of “moral efficiency”: does the trial improve clinical practice?

There are a number of ways of quantifying informativeness. For example, Hutchinson et al. measured informativeness in a retrospective longitudinal study, by following up trials longitudinally and identifying events that happened after trial completion, such as whether the trial influenced clinical practice guidelines, or was cited as part of a high-quality systematic review ³. These indicators can be subject to bias: a trial with industry funding is more likely to be cited, all other factors being equal, but may not necessarily be more informative. Likewise, a trial which shows that a drug is unsafe may not change guidelines or be cited, but prevents further uninformative trials being run, and is therefore informative. Hutchinson et al. acknowledged that the longitudinal technique is limited to a retrospective ‘thermometer’ rather than for prospective use.

In this paper, we use the concept of the “risk of ending uninformatively”, rather than the uninformativeness directly. An expert reviewer reads a protocol and classifies it as high, medium, or low risk of ending uninformatively, based solely on the content of the document. This approach is less data-driven and also subject to human biases, but does not require a longitudinal study to implement.

Quantifying trial risk of uninformativeness

A number of initiatives have been introduced to assist investigators in assessing the risk of a clinical trial. For example, the British National Institute for Health and Care Research has published a risk assessment tool as part of their Clinical Trials Toolkit ⁹. In 2017, a team funded by the Wellcome Trust developed a single-page clinical trials risk assessment tool ¹⁰, and in 2019 the company Mediana Inc released an R package for clinical trial simulations with an aim of reducing trial risk ¹¹.

New methods for simulating trials at the planning stage are becoming more popular, such as calculating the “assurance” of a trial, introduced by O’Hagan et al. in 2005, which is the unconditional probability that the trial will yield a positive outcome ¹². The assurance method can be used to choose a sample size, and has been implemented as an R package ¹³. In 2013, Wang et al. proposed a similar Bayesian method for calculating what they called the ‘probability of study success’ (PrSS) ¹⁴.

There have also been proposals to use probability distributions for making "go/no-go" decisions at clinical milestones such as the end of Phase I, IIa, or IIb ¹⁵. In 2009, Rosen et al. proposed the use of process maps throughout the trial to monitor if the trial is being conducted according to the Standard Operating Procedures (SOP) ¹⁶.

In 2018, Wong et al. published an analysis of 406,038 entries of clinical trial data, and calculated trial success rates by trial phase, pathology, year, industry, and other factors. They found a number of interesting patterns, for example, that trials that used biomarkers as part of their selection criteria have higher overall success probabilities than trials without biomarkers ¹⁷.

In 2022, a team at the Tufts Center for the Study of Drug Development analyzed 187 protocols and subsequent trials, and found that oncology and rare disease protocols have significantly longer clinical trial cycle times and face challenges in recruitment and retention ¹⁸. Phase III oncology trials are particularly troublesome, as they often deal with very small differences between arms. 62% of Phase III oncology trials fail to deliver results with statistical significance ¹⁹.

As an analogous metric to trial risk, there is also trial complexity. There are several numerical tools for estimating the complexity (and, by extension, cost) of clinical trials using simple scoring mechanisms in the style of the Apgar score, which is a well-known formula for evaluating the health of newborn babies ^{20–
22}.

Natural language processing

In recent years, Natural language processing (NLP) tools have disrupted a number of industries where large unstructured text documents are commonplace ^{23,
24}. The advent of models such as convolutional neural networks and transformer neural networks has enabled the development of AI systems which can understand complex natural language documents, such as contracts, or insurance claims ^{25–
28}. Clinical trial protocols may be several hundred pages long and require a large time investment by highly qualified people to interpret fully.

In the legal industry, NLP software is commonplace for organizing, tracking, and performing advanced predictive analytics, clustering, and discovery on legal cases which are formed of bundles of documents. Examples of advanced legal NLP software include Luminance ²⁹ and Everlaw ³⁰, which facilitate the manual review of legal contracts, leases, messages, depositions, interview transcripts, and other documents. We are not aware of a counterpart to these toolkits in the pharmaceutical industry, although there has been research and development on applications of NLP in the field ^{31–
33}.

In 2020, Richard et al. published a comparison of NLP techniques for use on clinical trial protocol deviations, focusing on term-frequency inverse-document-frequency (Tf*Idf), support vector machines (SVM), and word vector embeddings ³⁴. In the same year, Chen et al. performed a topic modeling analysis of the literature on NLP techniques in clinical trial texts, identifying key trends in NLP-enhanced clinical trial processing and research ³⁵.

Reducing the risk of a trial when the protocol is drafted

The Bill & Melinda Gates Foundation created a group called DAC (Design, Analyze, Communicate) which is intended to optimize the informativeness of research and includes resources on best practices to reduce the risk of trials ending uninformatively. An investigator choosing to work with DAC can submit their protocol draft, and receive a list of recommendations on 16 best practices for informativeness ⁴.

In 2018, David Fogel described a number of common causes of trial failure, such as underpowering, safety issues, and lack of funding. Fogel proposed several opportunities for applying artificial intelligence, in particular NLP, to identify these factors. He suggested using NLP to mine available literature and previous trials in order to determine if a trial is using appropriate endpoints, eligibility criteria, and sample sizes, and to check for internal inconsistencies in a protocol ³⁶. Fogel also suggested using other areas of AI and machine learning to profile patients to reduce the probability of attrition, modeling patient drop-out rates with neural networks.

In 2023, Chang et al. used a contrast mining framework to identify the key indicators of successful and unsuccessful cancer trials, and used NLP to extract eligibility criteria from protocol documents, among other techniques ³⁷.

We are unaware of any existing automated tool using only NLP to extract risk factors from trial protocols.

Although protocols are written in technical English, they are not constrained by any particular standard. Protocols from within a given organization generally follow a rough pattern, but there are many ways that a particular data point can be communicated: the sample size could be referred to as the “number of participants,” or “ N = 90,” or the researchers could write simply “we plan to enroll up to 100 subjects per site,” and leave it to the reader to infer the sample size.

The result is that a person reading a protocol, who simply wants to find the sample size, effect size, prevalence, or other figure, must search the entire text for a number of possible keywords, refer to the contents page, or even worse, begin reading the entire paper from start to finish. This is a time consuming and error-prone process, and far from the best use of a professional’s time.

In this paper, we present a software tool called the Clinical Trial Risk Tool ³⁸, which parses the PDF of a trial protocol and identifies key features within the text, such as the number of participants (the sample size), or the presence or absence of the Statistical Analysis Plan. The Clinical Trial Risk Tool then passes these features into a simple linear risk model and calculates a risk level, which is presented to the user as a traffic-light, indicating a high, medium, or low risk of ending uninformatively. The features of the risk model can be adjusted manually and saved by the user. The tool generates a PDF or Excel report which can be shared within the organization. The tool has been designed and trained with a focus on HIV and Tuberculosis (TB) trials but could be adapted to more pathologies.

The tool has been open-sourced under MIT License and deployed to the internet at the following link: https://app.clinicaltrialrisk.org.

Methods Implementation

The Clinical Trial Risk Tool is a web application written in Python ³⁹, using the graphical interface library Plotly Dash ⁴⁰, and the machine learning libraries NLTK ⁴¹, spaCy ⁴², and Scikit-Learn ⁴³. The tool was developed as a Docker container ⁴⁴ and can run on any browser.

The tool is architected as a Python server which contains most of the logic, which connects to a Java-based server running Tika for PDF parsing ( Figure 1).

Figure 1. The user’s browser connects to the Python server, running Dash and other logic (NLP, machine learning).

This server also connects to a separate server running Java and Tika for PDF parsing.

Users have the option to login and create and save profiles.

The tool allows a user to upload a trial protocol in PDF format. The tool processes the PDF into plain text using the software Apache Tika ⁴⁵, and identifies features in the document content which indicate high or low risk of uninformativeness.

The tool extracts eight features. The features are:

Pathology (limited to HIV and TB at this stage)

Trial phase

Is a Statistical Analysis Plan (SAP) present?

Is the effect estimate disclosed?

Number of subjects (sample size)

Number of arms

Countries of investigation

Does the trial use simulation for sample size determination?

The features are then passed into a scoring formula which scores the protocol from 0 to 100, and then the protocol is flagged as HIGH, MEDIUM or LOW risk.

The tool allows for risk profiles and thresholds to be saved and loaded if the user is logged in ( Figure 2).

Figure 2. Screenshot of the tool in operation, classifying a clinical trial protocol. Feature selection exercise

Before beginning development of the AI models, it was necessary to first identify the features which it would be advantageous for a tool to extract. It would be futile to spend a lot of time developing a model to extract a particular feature, only to discover that it has no or very little influence on the overall risk rating of a trial.

We conducted a survey of two subject matter experts within the Bill & Melinda Gates Foundation (BMGF) to gather information on which features would be helpful to focus on, independently of the technical difficulty of extracting that feature from the protocol text using NLP. Participants were sent a link to a questionnaire using the cloud-based software SurveyMonkey ⁴⁶, which asked them to rank features of a protocol text in order of their level of influence on the protocol’s success or failure in terms of informativeness (original questionnaire available here). A screenshot of the survey interface is given in Figure 8. The consensus was that the SAP is by far the most informative feature, so it was decided to focus initially on developing an NLP model to identify the presence or absence of an SAP. This survey, although qualitative, may be useful in future if the tool is developed further.

We took inspiration from the table of indicators of risk of uninformativeness given in Zarin et al. ⁶.

The results of our survey, ranked in descending order of importance, are given in Table 1.

Table 1. Results of a qualitative survey of feature importance for determining risk.

Weighting informativeness features	Mean score
Has an Statistical Analysis Plan	100%
Effect estimate not disclosed or unreliable	84%
tertile_of_sample_size by domain by phase	75%
Tertile of number of sites by domain by phase	72%
Composite product of tertile of Primary Duration times tertile of Sample Size	72%
tertile of number of (co-)primary endpoints by domain by phase	72%
Number of endpoints	66%
Multiple countries (Y/N)	56%
Uses simulation for sample size	56%
Prevalence estimate not disclosed or unreliable	53%
Is a master protocol or a subset or derivative of a master protocol	53%
Is part of a platform trial	53%
Number of visits	50%
Duration of trial	50%
Multiple sites in a single country trial (Y/N)	47%
Number of countries with at least one site	47%
Uses model-informed drug development	47%
Tertile of primary duration	44%
Number of arms	44%
Patient consortium or trial consortium prominently involved	44%
Is an adaptive design	41%
Takes place in a hospital	38%
Phase-in-domain	38%
Recency of protocol vs today's date	38%
Recent dates in prevalence/burden citations	38%
Indicates intention or willingness to make changes at interim	38%
Number of trial sites in entire trial	31%
Number of procedures	31%
Includes analysis of real world data	28%
More than one drug in the intervention	28%
Number of mentions of the word policy	25%
Case report form pages-all trial	6%
Case report form pages per variable	0%
Duration of follow up (in months)	0%
Two-level binomial lowest tertile of sample size by domain by phase	0%

Datasets used for training and validation

Two datasets were used to train and validate the tool.

Manual dataset. A set of 300 protocols, some supplied by the BMGF and some downloaded from ClinicalTrials.gov, were read through individually and annotated with key features: the sample size, pathology, number of arms, phase, intervention type, countries of investigation, presence of SAP, effect estimate, use of simulation. The number of protocols manually annotated per parameter varied between 100 and 300.

ClinicalTrials.gov dataset. The ClinicalTrials.gov dataset was a much larger dataset of 11,925 protocols which was downloaded from the ClinicalTrials.gov AACT data dump on 4 Nov 2022 ⁴⁷. The data dump came in the form of a PostgreSQL database ⁴⁸ and included the protocol PDFs and metadata on the National Clinical Trial (NCT) ID, phase, pathology, presence or absence of an SAP, number of arms and number of subjects. However, these values were voluntarily provided by the researchers and in many cases are out of date or inaccurate.

By combining the two datasets, it was possible to combine some of the advantages of a large dataset with some of the advantages of a smaller, more accurate dataset.

Breakdown of the individual machine learning models used

Each parameter is identified in the document by a stand-alone component using machine learning and, optionally, some manually coded rules. The machine learning techniques used were naive Bayes classifiers, random forest classifiers, and convolutional neural networks. Examples of manual rules are “a number followed by a unit such as ‘mmHg’ cannot be a sample size”. In particular, country names were identified using a dictionary lookup approach with some exceptions, such as “a mention of ‘Georgia’ is most likely to be the US state unless other words occur in the vicinity which indicate Georgia the country”.

The output values from these components are then fed into the linear risk model, which calculates the overall risk score of the protocol.

An additional Naïve Bayes classifier was used to obtain a baseline performance on each parameter before more advanced models were trained.

Pathology (condition). The pathology of a trial is identified using a three-way Naïve Bayes classifier operating on the text of the whole document on token level, which classifies documents into HIV, TB, or Other. It treats HIV and TB as mutually exclusive, although in future work more pathologies could be covered and the tool could assign a document to multiple pathologies.

To develop this, protocols were manually tagged as HIV, TB or other and the tool learnt which words are indicative of which pathology.

The classifier was trained on the manual dataset as a three-class classifier, but could easily be extended in future to cover more pathologies.

The tool also identifies key words and phrases throughout the document which are related to pathology and presents these to the user.

Trial phase. Trial phase is represented in the model by a floating-point number (whole number or whole number plus 0.5) between 0 and 4, where 1.5 means Phase I/II. The model for extracting the phase was implemented as an ensemble between a convolutional neural network text classifier, implemented using the NLP library spaCy, and a rule-based pattern matching algorithm combined with a rule-based feature extraction stage and a random forest binary classifier, implemented using Scikit-Learn (RRID:SCR_002577). Both models in the ensemble output an array of probabilities, which were averaged to produce a final array. The phase candidate returned by the ensemble model was the maximum likelihood value.

Presence or absence of statistical analysis plan (SAP). The presence or absence of an SAP is identified via a Naïve Bayes classifier operating on the text of the whole document on word level. In addition, candidate pages which are likely to be part of the SAP are highlighted to the user using a Naïve Bayes classifier operating on the text of each page individually.

Effect estimate. A rule-based component written in spaCy identifies candidate values for the effect estimate from the numeric substrings present in the document. These can be presented as percentages, fractions, or take other surface forms. A weighted Naïve Bayes classifier which is applied to a window of 20 tokens around each candidate number found in the document, and the highest ranking effect estimate candidates are returned. The values are displayed to the user, but only the binary value of the presence or absence of an effect estimate enters into the risk calculation.

Number of subjects (sample size). A rule-based component written in spaCy identifies candidate values for the sample size from the numeric substrings present in the document. These values are then passed to a random forest classifier, which ranks them by likelihood of being the true sample size, and identifies any substrings such as “per arm” or “per cohort”, which can then be used to multiply by the number of arms if applicable.

Number of arms. The number of arms is identified using an ensemble machine learning and rule-based tool using the NLP library spaCy and scikit-learn Random Forest.

Countries of investigation. The countries of investigation are identified using an ensemble of machine learning and rule based components using regular expressions and Keras convolutional neural networks, which are combined using a Scikit-Learn Random Forest model.

Simulation used for sample size determination. This is a Naïve Bayes classifier operating on the text of each page individually. If a page contains information about simulation being used for sample size, the classifier classifies that page as 1, otherwise as 0. If any page in the whole document is classified as class 1, then the protocol is considered to have used simulation for sample size determination.

Although trials may use simulation at various points, the data tagged for simulation includes only trials using simulation specifically for sample size planning. Trials using simulation for later stages of statistical analysis are excluded.

Sample size tertiles. The sample size is not fed directly into the risk model, but is converted into a value of 0, 1, or 2, representing the tertile of that sample size within a dataset of comparable trials (same phase and pathology).

The default sample size tertile threshold values are given in Table 2, but the user can change these values.

Table 2. Default sample size tertiles for HIV and tuberculosis (TB).

Phase	HIV lower tertile	HIV upper tertile	TB lower tertile	TB upper tertile
0	10	15	10	15
0.5	40	130	30	60
1	40	130	30	60
1.5	80	280	40	80
2	100	300	50	100
2.5	1000	2000	500	1500
3	1000	2000	500	1500
4	3000	4000	3000	4000

The default sample size tertiles were derived from a sample of 21 trials in LMICs, but have been rounded and manually adjusted based on statistics from ClinicalTrials.gov data.

The tertiles were first calculated using the training dataset, but in a number of phase and pathology combinations the data was too sparse and so tertile values had to be used from ClinicalTrials.gov. The ClinicalTrials.gov data dump was used from 28 Feb 2022.

Linear risk model. The features extracted by the NLP components are fed into a linear scoring formula, which was designed for this software.

Each parameter is converted into an integer or floating-point number, and multiplied by an associated weight, and this is used to calculate a score between 0 and 100. From this score, the protocol is flagged as HIGH, MEDIUM or LOW risk. For example, a protocol scores 26 points for having a completed SAP, and a protocol scoring above 50 points in total for all features is considered low risk. The linear formula has a bias term of -7.

Protocols scoring 50 or above are considered by default to be low risk. Protocols which score 40 or above but below 50 are marked as medium risk, and scores below 40 are high risk.

Our formula can be summarized as follows:

s = 26 x S A P + 16 x e f f e c t e s t i m a t e + 10 x s a m p l e s i z e t e r t i l e + 10 x i n t e r n a t i o n a l + 10 x s i m u l a t i o n + 5 x p h a s e + 2 x a r m s − 7

Where the x _i are the features extracted from the text. All features are binary variables except for sample size tertile (0 = small trial, 1 = medium trial, 2 = large trial), phase, and number of arms (which is capped at 5 to avoid distortions caused by any trials with an unusually large number of arms).

Our formula can be seen as a form of linear regression, where the weights were arrived at via human reasoning rather than a loss function.

The risk values were arrived at as part of a qualitative process in consultation with subject matter experts, who identified the features that they would look for in assessing a protocol for risk manually. The consensus was that the SAP is by far the strongest predictor of risk (a trial lacking an SAP is extremely unlikely to succeed).

The default feature weights are given in Table 3.

Table 3. High and low risk thresholds for the total protocol score, and the default weights.

Feature	Value or weight
High risk threshold	40
Low risk threshold	50
Number of arms	2
Trial phase	5
SAP completed?	26
Effect Estimate disclosed?	16
Number of subjects low/ medium/high	10
Trial is international?	10
Trial uses simulation?	10
Constant (bias)	-7

Operation

The Clinical Trial Risk Tool can be accessed via any web browser here. All computations are conducted remotely on a Python server. The software has an embedded video tutorial to ease the learning process. The user interface contains mouseover tooltips with layperson-friendly explanations of the options in the tool.

The user can adjust the sample size tertile thresholds and weights associated with the features in the user interface and save this as a configuration file.

A user has the option to click on the Login button to create a user account and save their configuration on the server. Authentication is managed by the third party authentication provider Auth0.com.

If a user wishes to use the application anonymously, all functionality is still available without logging in, but the user is not able to save and retrieve profiles at a later date.

Workflow

A user uploads a PDF file of a clinical trial protocol, either by dragging and dropping it into the tool, or by using a file selector dialog. On the server side, the tool parses the raw PDF file into plain text, and then presents the user with the features that were identified in the text: pathology, phase, SAP, effect estimate, sample size, sample size tertile, number of arms, countries of investigation, and simulation.

The user can then correct these features by clicking on dropdowns and selecting or typing the correct value in the GUI.

In real-time, the features are fed into the risk model which presents the protocol’s risk level as a color-coded HIGH, MEDIUM or LOW risk.

The GUI includes a graph view of the key terms’ locations within the document by page number, allowing the user to quickly identify pages which are heavy in statistical content or other relevant terms. The tool’s analysis of the protocol of an HIV trial in 49 is shown in Figure 3.

Figure 3. The graphical user interface showing the graph view of key statistical analysis plan-related terms by page number in the document.

The user can export the risk assessment with all explanations and key figures to an Excel or PDF file.

Finally, if the user has changed the sample size tertile thresholds or feature weights, this configuration can be saved on the server (if the user is authenticated), or to the user’s local machine.

Results User testing

The tool was tested by internal and external subject matter experts, who provided feedback throughout the project. In this way, inaccuracies and pain points could be identified and fixed in an iterative process.

Validation

For validation on the manual dataset, cross-validation was used. For validation on the ClinicalTrials.gov dataset, Trials with values 0–7 as the third digit of their numeric NCT ID were used for training, with value 8 were used for validation, and those with value 9 are held out as a future test set.

Validation scores for manual dataset

The validation scores on small manually labeled dataset (about 100 protocols labeled, but 300 labeled for number of subjects) are given in Table 4.

Table 4. Validation scores on manual dataset.

Component	Accuracy – manual validation dataset	AUC – manual validation dataset	Technique
Condition (Naïve Bayes)	88%	100%	Naïve Bayes
Statistical analysis plan (Naïve Bayes)	85%	87%	Naïve Bayes
Effect Estimate	73%	95%	Ensemble: rule based + Naïve Bayes
Number of Subjects	69% (71% within 10% margin)	N/A	Ensemble: rule based + Random Forest
Simulation	94%	98%	Naïve Bayes

Each component was validated using accuracy and Area Under the Curve (AUC).

Validation scores for ClinicalTrials.gov dataset

Accuracy figures are reported in Table 5 together with performance of a comparable Naïve Bayes baseline model trained on the ClinicalTrials.gov training dataset, which can provide an estimate of a reasonable baseline performance.

Table 5. Validation scores on the ClinicalTrials.gov dataset.

Component	Accuracy – ClinicalTrials. gov validation dataset	Baseline Accuracy (Naïve Bayes) – ClinicalTrials. gov validation dataset	Technique
Phase	75%	45%	Ensemble: rule based + Random Forest
SAP	82%	82%	Naïve Bayes
Number of Subjects	13%	6%	Ensemble: rule based + Random Forest
Number of Arms	58%	52%	Ensemble: rule based + Random Forest
Countries of Investigation	AUC 87%	N/A	Ensemble: rule based + CNN + Random Forest + Naïve Bayes

Validation scores for Hutchinson <italic toggle="yes">et al.</italic> dataset

In addition to validating the performance of the NLP components of the tool, it was also necessary to validate the risk model.

We took the dataset of 125 trials analyzed by Hutchinson et al. in their 2022 analysis, where they attempted to establish the proportion of RCTs that inform clinical practice ³. Unfortunately, only six of the protocols in that study could be located from ClinicalTrials.gov.

We passed these six protocols through the tool and compared the risk output of the tool to whether or not Hutchinson et al. considered the trials informative ( Table 6). On this small dataset, the tool predicted informativeness with 100% AUC (the two trials scoring 60 or below were not informative). This was a useful sanity check for the risk model, although the test set is far too small for this test to be scientifically rigorous.

Table 6. Validation scores on Hutchinson <italic toggle="yes">et al.</italic> dataset.

NCT	Disease	Informative (Hutchinson et al.)	Risk score from tool	Risk label from tool
NCT00946712	LUNG	0	72	LOW
NCT01032629	DIAB	1	60	LOW
NCT01107626	LUNG	0	62	LOW
NCT01144338	DIAB	1	48	MEDIUM
NCT01205776	CVS	0	69	LOW
NCT01206062	CVS	0	69	LOW

Discussion

We have validated the individual components of the tool separately. Accuracies vary among the datasets and components validated.

In particular, the sample size component’s accuracy on the ClinicalTrials.gov dataset was particularly inaccurate. The low performance for that value is due to lack of a reliable gold standard, rather than low performance of the risk tool itself. The sample size identification was particularly challenging and required the manual labeling of 300 protocols in order to achieve a performance that was acceptable in user testing. This is because the sample size cannot be reduced to a simple three- or four-way classification problem like many of the other features, but is a problem of data extraction with many confounding factors such as false positives.

It is fortunate that some of the most important features, such as the presence or absence of the SAP, were relatively easy to identify with machine learning (since SAP can be reduced to a binary classification problem, which is one of the easiest kinds of problems to solve in machine learning).

We were able to look inside the parameters of the models that are used to extract the individual features, in order to search for any potential improvements. For example, the sample size extraction component identifies candidate sample sizes in the text using a set of manually created rules, and calculates features for each of them (distance in tokens to the term “sample size”, etc). The Random Forest model allows us to visualize the feature importances of the model, and we see at a glance that the strongest indicators that a number in the text is the true sample size are the distance to the terms “sample size” and “number of subjects”, followed by the num_occurrences (the number of times that number occurs in the text). The feature importances of the sample size classifier are shown in Figure 4.

Figure 4. Feature importances for sample size extractor (random forest).

Likewise, the feature importances for the component that extracts mentions of “simulation” are shown in Figure 5.

Figure 5. Feature importances for simulation extractor (random forest).

We have also explored the performance of the models using more sophisticated metrics than AUC and accuracy. For example, Figure 6 shows the confusion matrix of the phase extractor. We can see at a glance that the commonest phases in the dataset are 2 and 3, and phase 2 is likely to be confused with phase 1.5 (I/II).

Figure 6. Confusion matrix for phase extractor (ensemble model).

The confusion matrix visualization also makes it clear how much harder the sample size identification is compared to the other features that the tool extracts from the protocol text. Figure 7 shows the confusion matrix for the sample size detection component.

Figure 7. Confusion matrix for sample size extractor (ensemble model). Figure 8. The survey on informativeness features.

In our accuracy calculations, we have considered a sample size to be correct only when it is exactly equal to the true value, so a predicted value of 61 for a ground truth of 62 would be considered an error. For the purposes of the confusion matrix, we allowed a tolerance of 1 significant figure. We can see at a glance that low sample sizes (10–30) are the ones most likely to be confused by the model.

We have provided Jupyter notebooks in the repository to run the validation and reproduce the results.

It was not possible to conduct a thorough analysis of the linear risk model due to data on “informativeness” of clinical trials being harder to obtain, and the intersection of that data with the available trial protocol documents being small. Further studies are needed to validate the risk modeling part of the tool.

Example scenarios and user journeys with the Clinical Trial Risk Tool Scenario 1: triage

A funding organization receives large volumes of incoming protocols. They have a team of reviewers who are reading the documents and categorizing them as ‘go’ or ‘no-go’. The majority of protocols are not accepted for funding, because they do not meet some of the funder’s criteria. The organization would prefer to spend less time on the high-risk protocols.

Using the Clinical Trial Risk Tool, the reviewers would be able to quickly identify the incoming protocols which should not be considered for funding, such as those which are missing key statistical information. This frees up more of their time to process the high-quality protocols.

Scenario 2: standardization of review

When protocols are passed to reviewers, each reviewer typically comes from a different background and brings with them their own way of viewing a protocol. The reviewing team could use the tool to calibrate and standardize their review processes for greater consistency. For example, they could agree on a standard set of weights and parameters for the model and save it on an organizational or departmental level.

Scenario 3: pre-submission vetting

An investigator is preparing a trial protocol for submission as part of a funding application. Each funding organization has their own checklist of key ‘must-haves’ and ‘should-haves’ in a trial. The applicant uses the Clinical Trial Risk tool to vet their protocol and identify any weak points. For example, the tool may flag the trial as high risk because the expected effect estimate is not clearly stated. This gives the investigator an opportunity to correct the issue before submission, increasing the chances of acceptance.

Scenario 4: training

The tool can be used for education and training of investigators or reviewers on what makes a robust protocol, facilitating the upskilling of junior reviewers.

Scenario 5: auto-populating risk questionnaire

Some funding organizations, such as the Bill & Melinda Gates Foundation, require a risk assessment questionnaire (the DAC risk assessment questionnaire) to be submitted together with the protocol. If the tool is exposed as an application programmable interface (API), it can be used for auto-population of the risk assessment questionnaire. This streamlines the submission process, as the tool can retrieve important information from the PDF in seconds, freeing the applicant to do other tasks.

Scenario 6: adapting source code for a new domain

A pharmaceutical company may like to use the tool to estimate the cost of an oncology trial. The tool source code is open source, so the pharmaceutical company can engage a developer to modify the tool to estimate a dollar value of the trial. New features have to be added, such as cancer stage, and number of chemotherapy cycles, but fortunately the developer can ‘recycle’ the code that is currently identifying trial phase for these purposes. The company has a database of past trials and confidential and sensitive industry data on their cost over the last ten years, which are used to train a regression model to predict the cost. The tool’s performance can be validated on data on the most recent trials if that has been withheld from the training data. The pharmaceutical company now has a customized in-house cost estimation tool. Since the Clinical Trial Risk Tool is under MIT License, this means that the pharma company is not obligated to share its in-house cost model, which contains industry-sensitive data, but they choose to put the oncology-specific NLP features that they have added to the tool in the public domain.

Conclusions

We have developed a software tool which we believe is unique in using natural language processing to provide a risk profile of a clinical trial protocol.

The tool can assist a human in assessing the risk of uninformativeness of a trial, and understanding which factors contribute to the risk of uninformativeness. With the use of this tool, reviewers may be able to assess trials more rapidly, and the tool could be used to inform stakeholders about the most impactful features for risk of uninformativeness. The tool can also assist reviewers in assessing trials more consistently, and investigators may use it to validate their draft protocols before submitting them to a funding organization.

The use of the tool is intuitive and the software is open-source and can be accessed via any web browser, allowing clinical trial investigators who do not have the expertise in software or programming to use the tool.

Since the software is open source under an MIT License, an investigator can easily fork the project and extend it to another field such as oncology, or to predict trial cost or complexity, with relatively little effort.

Validation of the tool has been complex because each component of the tool has been designed independently, and the data on ClinicalTrials.gov is not entirely accurate because it depends on researchers updating their profiles manually. It was time consuming to manually annotate large numbers of protocols, but further manual labeling could pave the way for further improvements in accuracy. There is still much scope for improvement of several features, especially sample size.

The tool is trained to detect only two pathologies, HIV and TB. However, if a user uploads a protocol from a different pathology, they could still use the tool for risk assessment, but they would need to set appropriate values for the feature weights and sample size tertiles. For some high-risk pathologies, such as oncology or cardiovascular disease, we would not expect the tool to be as accurate at identifying risk, because of the importance of other features, such as biomarkers, enrolment criteria, toxicity of treatment, and chemotherapy cycles, which are not currently handled in the tool, but which are important for these pathologies ⁷.

Future work on this project could involve broadening the scope to more pathologies, or altering the tool to predict cost, complexity or other key metrics of a trial. If we were to extract further features from the text using NLP, candidate features would include the number of endpoints, the prevalence estimate not being disclosed, the trial being a platform trial, the protocol being a master protocol, and more.

User requested features include support for multi-document protocols (e.g. Protocol and SAP in separate PDFs), or support for processing of multiple documents at the same time, or exposing the tool as an API or library.

One potential future extension of the project would see the tool developed further into a case management system, which would ingest protocols, SAPs, questionnaires, and regulatory paperwork, and track the associated metadata on trial level, similar to the legal case management systems described in the Introduction.

Ethics and consent

No ethical approval was sought for this study due to the very low risk nature of the survey conducted, where no personal or identifiable information was collected.

Completion of this survey implied consent for data collection, with written informed consent obtained from each participant before the publication of this manuscript for publication and use of their data

Abbreviations

SAP: Statistical Analysis Plan

NLP: Natural Language Processing

HIV: Human Immunodeficiency Virus

TB: Tuberculosis

CNN: Convolutional Neural Network

NLTK: Natural Language Toolkit

Tf*Idf: term frequency*inverse document frequency

AI: Artificial Intelligence

GUI: Graphical User Interface

AUC: Area Under the [ROC] Curve

ROC: Receiver Operating Characteristic

PDF: Portable Document Format

API: Application Programmable Interface

RCT: Randomized Clinical Trial

Data availability Source data

A set of protocols in text format, and accompanying metadata, were used for the training and evaluation of the tool. The majority of protocols used in training were taken from ClinicalTrials.gov, and the source repository of this tool contains instructions on downloading the data. A small number of protocols are not available on the internet and are internal to the Bill & Melinda Gates Foundation.

The list of 125 protocols used for validation of the risk model were taken from Hutchinson et al. ³. Their dataset is available here https://doi.org/10.17605/OSF.IO/3EGKU.

Underlying data

Zenodo: Feature weights for protocol informativeness. https://doi.org/10.5281/zenodo.7769176 ⁵⁰.

This project contains the following underlying data:

v1 BMGF DAC Feature Weights Informativeness.xlsx (Responses to SurveyMonkey questionnaire)

Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Software availability

Software available from: https://app.clinicaltrialrisk.org/

Source code: https://github.com/fastdatascience/clinical_trial_risk

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7633872 ³⁸

License: MIT.

Acknowledgements

We would like to acknowledge the help and advice given by Shawn Dolley and Dr. Thea C. Norman.

Appendix

Future

If the NCT # is found in the protocol, sample size data can be retrieved from ClinicalTrials.gov API.

Number of sites

Primary duration

Number of primary endpoints

Prevalence estimate not disclosed

Is a master protocol or a subset or derivative of a master protocol

Is part of a platform trial

Number of visits

Duration of trial

Multiple sites in a single country trial

Number of countries with at least one site

Uses model-informed drug development

Tertile of primary duration

Patient consortium or trial consortium prominently involved

Is an adaptive design

Takes place in a hospital

phase-in-domain

Recency of protocol vs today's date

Recent dates in prevalence/burden citations

Indicates intention or willingness to make changes at interim

Number of trial sites in entire trial /

Number of procedures

Includes analysis of real-world data

More than 1 drug in the intervention cocktail

Number of mentions of the word policy

Case report form pages - all trial

Case report form pages per variable

Duration of follow up (in months)

External sponsorship

Non-standard endpoint

Trial uses cluster sampling

No trial database used

High number of follow-up appointments

Strict recruitment criteria (age, medical history)

Crossover design

Multiple consents, tests and forms for participants to fill out

Multiple randomisation steps

Extended investigational treatment or lengthy regimen until progression

Low disease prevalence

Trial takes place in hospital

Trial is a platform trial

Trial has sub-studies

Trial used model informed approach

Complex age criteria in recruitment

https://www.who.int/observatories/global-observatory-on-health-research-and-development/monitoring/number-of-trial-registrations-by-year-location-disease-and-phase-of-development

Yordanov

Dechartres

Porcher

: Avoidable waste of research related to inadequate methods in clinical trials. BMJ. 2015;350:h809. 25804210

10.1136/bmj.h809

4372296

Hutchinson

Moyer

Zarin

: The proportion of randomized controlled trials that inform clinical practice. eLife. 2022;11:e79491. 35975784

10.7554/eLife.79491

9427100

Bill & Melinda Gates Foundation: Uninformative research is the global health crisis you’ve never heard of. 2023; retrieved 12 Feb 2023. Reference Source

World Medical Association: Declaration of Helsinki.(1964, rev. 2022). Reference Source

Zarin

Goodman

Kimmelman

: Harms from uninformative clinical trials. JAMA. 2019;322(9):813–814. 31343666

10.1001/jama.2019.9892

Grignolo

Pretorius

: Phase III trial failures: Costly, but preventable. Appl Clin Trials. 2016;25(8):36–42. Reference Source

Hwang

Carpenter

Lauffenburger

: Failure of investigational drugs in late-stage clinical development and publication of trial results. JAMA Intern Med. 2016;176(12):1826–1833. 27723879

10.1001/jamainternmed.2016.6008

National Institute for Health and Care Research: Clinical Trials Toolkit: Risk Assessment. (retrieved 12 Feb 2023). Reference Source

Fuller

: Developing a study risk assessment tool.UKCRF Network study risk assessment tool group, 2017. Reference Source

Dressler

: Clinical Trial Optimization Using R.Alex Dmitrienko and Erik Pulkstenis. Boca Raton, FL: Chapman & Hall/CRC Press, 2019;73(2):210–211. 10.1080/00031305.2019.1603479

O’Hagan

Stevens

Campbell

: Assurance in clinical trial design. Pharm Stat. 2005;4(3):187–201. 10.1002/pst.175

Alhussain

Oakley

: Assurance for clinical trial design with normally distributed outcomes: Eliciting uncertainty about variances. Pharm Stat. 2020;19(6):827–839. 32537910

10.1002/pst.2040

Wang

Kulkarni

: Evaluating and utilizing probability of study success in clinical development. Clin Trials. 2013;10(3):407–13. 23471634

10.1177/1740774513478229

Chuang-Stein

French

Kirby

: A quantitative approach for making Go/No-Go decisions in drug development. Therapeutic Innovation & Regulatory Science. 2011;45:187–202. 10.1177/009286151104500213

Rosen

Johnson

Kebaabetswe

: Process maps in clinical trial quality assurance. Clin Trials. 2009;6(4):373–377. 19625329

10.1177/1740774509338429

Wong

Siah

: Estimation of clinical trial success rates and related parameters. Biostatistics. 2019;20(2):273–286. 29394327

10.1093/biostatistics/kxx069

6409418

Getz

Smith

Kravet

: Protocol design and performance benchmarks by phase and by oncology and rare disease subgroups. Ther Innov Regul Sci. 2023;57(1):49–56. 35960455

10.1007/s43441-022-00438-5

9373886

Amiri-Kordestani

Fojo

: Why do phase III clinical trials in oncology fail so often? J Natl Cancer Inst. 2012;104(8):568–569. 22491346

10.1093/jnci/djs180

Apgar

: A proposal for a new method of evaluation of the newborn infant. Curr Res Anesth Analg. 1953;32(4):260–267. 13083014

Calvin-Lamas

Pita-Fernandez

Pertega-Diaz

: A complexity scale for clinical trials from the perspective of a pharmacy service. Eur J Hosp Pharm. 2018;25(5):251–256. 31157035

10.1136/ejhpharm-2017-001282

6452378

Metrics Champion Consortium Protocol Operational Complexity Scoring Tool: Clinical Trial Risk & Performance Management vSummit. 2020; retrieved 3 March 2023. Reference Source

Forbes

: Distilling Constituent Symptoms and Patterns of Repetition in the Diagnostic Criteria of the DSM-5.OSF, Web,2023. Reference Source

Yadav

Kar

Kashiramka

: Artificial Intelligence Adoption for FinTech Industries-An Exploratory Study About the Disruptions, Antecedents and Consequences.The Role of Digital Technologies in Shaping the Post-Pandemic World: 21st IFIP WG 6.11 Conference on e-Business, e-Services and e-Society, I3E 2022, Newcastle upon Tyne, UK, September 13–14, 2022, Proceedings. Cham: Springer International Publishing,2022. 10.1007/978-3-031-15342-6_1

Chalkidis

Fergadiotis

Malakasiotis

: LEGAL-BERT: The muppets straight out of law school.arXiv preprint arXiv: 2010.02559,2020. 10.48550/arXiv.2010.02559

Matsuda

Ohtomo

Tomizawa

: Incorporating Unstructured Patient Narratives and Health Insurance Claims Data in Pharmacovigilance: Natural Language Processing Analysis of Patient-Generated Texts About Systemic Lupus Erythematosus. JMIR Public Health Surveill. 2021;7(6):e29238. 34255719

10.2196/29238

8278300

Fernando

Kumarage

Thiyaganathan

: Automated vehicle insurance claims processing using computer vision, natural language processing. 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer). IEEE,2022. 10.1109/ICTer58063.2022.10024089

Eliot

Dr : Generative pre-trained transformers (GPT-3) pertain to AI in the law. 2021. 10.2139/ssrn.3974887

Luminance.software, retrieved 27 Feb 2023. Reference Source

Everlaw.software, retrieved 27 Feb 2023. Reference Source

Luo

Thompson

Herr

: Natural Language Processing for EHR-Based Pharmacovigilance: A Structured Review. Drug Saf. 2017;40(11):1075–1089. 28643174

10.1007/s40264-017-0558-6

Dutton

: Big Pharma Reads Big Data, Sees Big Picture: Linguamatics Brings Natural Language Processing to Non-Experts, Expediting Drug Development. Genet Eng Biotechnol News. 2018;38(1):8–9. 10.1089/gen.38.01.05

Viswanath

Fennell

Balar

: An industrial approach to using artificial intelligence and natural language processing for accelerated document preparation in drug development. J Pharm Innov. 2021;16:302–316. 10.1007/s12247-020-09449-x

Richard

Reddy

: Text classification for clinical trial operations: evaluation and comparison of natural language processing techniques. Ther Innov Regul Sci. 2021;55(2):447–453. 33125616

10.1007/s43441-020-00236-x

Chen

Xie

Cheng

: Trends and features of the applications of natural language processing techniques for clinical trials text analysis. Appl Sci. 2020;10(6):2157. 10.3390/app10062157

Fogel

: Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: A review. Contemp Clin Trials Commun. 2018;11:156–164. 30112460

10.1016/j.conctc.2018.08.001

6092479

Chang

Liu

Mitchem

: Understanding Common Key Indicators of Successful and Unsuccessful Cancer Drug Trials Using A Contrast Mining Framework on ClinicalTrials.gov. J Biomed Inform. 2023;139:104321. 36806327

10.1016/j.jbi.2023.104321

Wood

: Clinical Trial Risk Tool (0.1). Zenodo. [Code],2023. http://www.doi.org/10.5281/zenodo.7633872

Van Rossum

Drake

: Python 3 Reference Manual.CreateSpace, 2009. Reference Source

Plotly Technologies Inc: Collaborative data science. 2015. Reference Source

Bird

Klein

Loper

: Natural language processing with Python: analyzing text with the natural language toolkit.O’Reilly Media, Inc. 2009. Reference Source

Honnibal

Montani

: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.

Pedregosa

Varoquaux

Gramfort

: Scikit-learn: Machine Learning in Python.JMLR. 2011;12:2825–2830. Reference Source

Merkel

: Docker: lightweight Linux containers for consistent development and deployment.Linux Journal, 2014;2014(239):2. Reference Source

Mattmann

Zitting

: Tika in action. 2012. Reference Source

SurveyMonkey.software, retrieved 25 March 2022. Reference Source

Tasneem

Aberle

Ananth

: The database for aggregate analysis of ClinicalTrials.gov (AACT) and subsequent regrouping by clinical specialty. PLoS One. Database dump taken from, 2012;7(3):e33677. 22438982

10.1371/journal.pone.0033677

3306288

PostgreSQL Global Development Group: PostgreSQL 12.13. 2022. Reference Source

Sharp

Corp

: A Single-Dose Clinical Trial to Study the Safety, Tolerability, Pharmacokinetics, and Anti-Retroviral Activity of MK-8591 Monotherapy in Anti-Retroviral Therapy (ART)-Naïve, HIV-1 Infected Patients.In: ClinicalTrials.gov.[cited 21 Dec 2016]. Reference Source

Wood

Douglas

: Feature weights for protocol informativeness [Data set]. Zenodo .2023. http://www.doi.org/10.5281/zenodo.7769176

10.21956/gatesopenres.15729.r34936

Reviewer response for version 1

Idnay

Betina

1 Referee 1Columbia University, New York, New York, USA

Competing interests: No competing interests were disclosed.

29 9 2023

2023

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

In addressing the prevalent issue of trial uninformativeness, the manuscript introduces a browser-based, natural language processing tool designed to identify and quantify the risk of uninformativeness in clinical trials. The tool, initially focusing on human immunodeficiency virus (HIV) and tuberculosis (TB) trials, parses trial protocols, extracts key design features, and inputs them into a risk model, demonstrating high accuracy in identifying various trial conditions and features. Users can interactively upload, visualize, and correct the tool’s interpretations. The study validates the tool’s efficacy using manually tagged datasets and a large dataset from ClinicalTrials.gov, showcasing promising results, such as 100% Area Under Curve (AUC), in identifying the condition of a trial. The tool, open-source and accessible at https://app.clinicaltrialrisk.org, offers significant potential for future expansion to other pathologies and advancement in the field.

The manuscript presents a distinctive and essential contribution to addressing the challenge of trial uninformativeness, a pervasive issue impacting the quality of evidence generated from clinical trials. By focusing on the early identification of risks of uninformativeness in trial protocols, the authors are addressing a critical gap in ensuring the optimal allocation of resources and efforts toward generating high-quality evidence. Developing a browser-based tool using natural language processing is particularly original, as it enables the automated extraction and analysis of key features from unstructured text documents, which is a significant advancement in this field. Furthermore, the tool’s open-source nature and accessibility contribute to its significance by facilitating widespread adoption and adaptation, ultimately aiming to elevate the quality of clinical trials and the evidence they produce. This work is thus both timely and imperative, aligning with the pressing need for high-quality evidence in clinical, policy, and research decisions.

Major comments:

a. Introduction: Several aspects could be further clarified to enhance the reader’s understanding and the overall impact of the manuscript:

Specificity on Commercial Factors: The manuscript mentions that commercial factors can lead to trials ending uninformatively. However, it would benefit the reader if this point could be elaborated on more specifically. Providing concrete examples or detailing how commercial factors contribute to trial uninformativeness would add depth to the discussion and enhance the overall comprehension of the issue.

Clarification on Trial Complexity and Apgar Score: The paragraph discussing trial complexity and its relation to uninformativeness was somewhat perplexing. A more thorough explanation of how trial complexity contributes to uninformativeness would be valuable for the reader. Additionally, mentioning the Apgar score seemed tangential and did not directly contribute to the main discussion. A reconsideration of its inclusion or a more explicit connection to the main topic might be warranted to maintain focus and clarity.

Rationale for Focus on HIV and TB: The manuscript specifies that the tool initially focuses on HIV and tuberculosis trials. It would be enlightening to understand the justification for this specific focus. Clarifying whether this decision was due to the prevalence of these diseases, the availability of data, or other reasons would provide valuable context and strengthen the significance of the work.

b. Methods: several aspects could benefit from further clarification and detail to ensure thorough understanding and reproducibility:

Feature Selection and Subject Matter Expert Analysis: The feature selection exercise needs more explicit detailing. While the features are visualized in a table, mentioning the number and selection criteria in the text would enhance clarity. The analysis of subject matter expert ranking and the criteria for choosing these experts must be made more explicit. Were the experts specialized in HIV/TB or knowledgeable in clinical trial protocol development? Additionally, Table 1 presents identical data in separate categories (i.e., “tertile_of_sample_size by domain by phase” and “Tertile of number of sites by domain by phase”); an explanation for this separation is necessary. Including results in this section seems misplaced and would be more appropriately discussed in the Results section.

Datasets for Training and Validation: The selection process for the protocols used in training and validation is ambiguous. It is unclear whether the manual dataset included protocols from ClinicalTrials.gov and if there was a possibility of duplication. Clarifying whether the datasets were stratified for training and validation or if the manual dataset was for training and AACT for validation is essential for understanding the validation process. How did they test the models?

Annotation of ClinicalTrials.gov Dataset: It needs to be clarified whether the ClinicalTrials.gov dataset was annotated. Explicit mention of the annotation status of this dataset would remove ambiguity and aid in understanding the methodology.

Annotation Process: The statement, “The number of protocols manually annotated per parameter varied between 100 and 300,” is somewhat confusing. More clarity on the number of protocols annotated, the identity and number of annotators, and the annotation process is necessary for reproducibility. Providing the annotation guideline as supplementary material would be highly beneficial, if not essential, for reproducibility.

Machine Learning Models: A detailed breakdown of the machine learning models used, and the rationale behind the chosen score cutoff is needed. Exploring why other ML models were not considered would provide insight into the model selection process.

c. Results: there are a couple of areas where further information and restructuring would enhance clarity and coherence:

User Testing Method and Feedback: This section mentions user testing but needs to detail the method used in the Methods section, making it challenging to understand the context of the results. It would benefit the reader to gain insight into the user testing methods, the feedback received, the subsequent results, and any implications or changes made based on this feedback.

Placement of Validation Content: The subsections titled "Validation,” "Validation scores for the manual dataset,” "Validation scores for ClinicalTrials.gov dataset,” and "Validation scores for Hutchinson et al. dataset" seem to contain content more suited to the Methods section as they describe the methodology used for validation. It would enhance the clarity and flow of the manuscript if this content were relocated to the Methods section, with the Results section expounded upon to focus solely on discussing and interpreting the validation outcomes rather than detailing the method used. Presenting tables and highlighting the key findings would align with this section's purpose and contribute to a more balanced and informative manuscript.

d. Discussion and Conclusion: Several elements need addressing to ensure clarity, coherence, and comprehensive representation of the study’s findings and implications:

Claim on Tool’s Intuitiveness: The manuscript concludes that the use of the tool is intuitive; however, there is a noticeable absence of supporting results regarding the tool's intuitiveness and usefulness within the Results section. Providing concrete findings or user feedback to substantiate this claim would bolster the credibility of this assertion and give the reader a clearer understanding of the tool's practicality and user-friendliness.

Inclusion of Scenarios: The inclusion of scenarios illustrating the potential applications and usefulness of the tool is a valuable addition to the manuscript. It helps contextualize the tool's practicality and provides insight into its real-world implications, contributing to a more rounded and impactful discussion.

Confidentiality of Clinical Trial Protocols: Clinical trial protocols often contain confidential information, especially for Scenario 3. The manuscript must address how the web application ensures the privacy and security of the uploaded protocols. Clarifying this aspect is crucial for user trust and adherence to data protection regulations.

Placement of Results: Some results appear in the Discussion and Conclusion sections. For clarity and structure, it would be beneficial to relocate these findings to the Results section, ensuring a clear delineation between the presentation of results and their interpretation and implications.

Limitations Section: The manuscript would greatly benefit from a more explicit and organized discussion of the study's limitations and the web application. A dedicated subsection detailing the limitations would provide the reader with a balanced view of the research and help contextualize the findings, allowing for a more nuanced understanding of the study's implications and areas for future improvement.

e. Clarity and organisation: The manuscript, while addressing a topic of significant importance and utility, presents several areas where clarity and organization could be enhanced to convey the research’s value and findings better:

Defined Sections and Appropriate Placement: There is a noticeable overlap of content throughout the manuscript, with methods detailed in the results and discussion sections and results interspersed within the discussion. A more precise delineation and structuring of content according to the designated sections would aid in the reader’s comprehension and the logical flow of the manuscript.

Detailing and Specificity: Several instances across the introduction, methods, and results sections indicate a need for more specificity and detailing. Clarifications on commercial factors leading to trial uninformativeness, confidentiality measures, feature selection, user testing methods, and annotation processes would contribute to a more thorough and transparent representation of the research conducted.

User-Friendly Presentation: Given the technical nature of the tool developed, ensuring a user-friendly presentation of information, such as clear and well-ordered tables, is essential. Addressing inconsistencies in table presentation and providing narrative explanations for significance and implications would enhance the manuscript’s accessibility and impact.

Minor comments:

Abstract: The sentence has a grammatical error: “The tool is focused HIV and tuberculosis trials but could be extended to more pathologies in future.”

Introduction: The acronyms "AI," "HIV," and "PDF" were not spelled out upon their first mention in the text. To ensure clarity and accessibility for all readers, consider providing the full forms of these acronyms at their initial appearance.

Methods: Including a flow diagram of the methods would significantly enhance the reader’s understanding of the research process and contribute to the overall clarity of the manuscript. The acronym "AACT" was not spelled out upon its first mention; provide the complete form of this acronym; please double-check other abbreviations.

Figures and Tables: For ease of reference and a smoother reading experience, it is suggested that tables and figures be presented immediately after they are referenced in the text. This adjustment will prevent readers from having to search through the manuscript and will contribute to a more organized and user-friendly presentation.

In conclusion, the manuscript under review presents a commendable effort in addressing the crucial issue of clinical trial uninformativeness through the development of a novel tool. The application of natural language processing in identifying and quantifying risks associated with trial protocols exhibits significant potential to enhance the quality of clinical research. However, a prominent area of concern is the lack of adequate user testing reported in the manuscript. While the authors have undertaken meticulous training and validation of the tool, the inclusion of thorough user testing is paramount to ensuring the tool’s efficacy, user-friendliness, and practical applicability in real-world scenarios. Further, it is imperative that the authors undertake comprehensive testing of the tool, beyond training and validation, to affirm its reliability and robustness. Addressing these aspects, along with the aforementioned comments on clarity, organization, and minor adjustments, will significantly contribute to the manuscript’s coherence, impact, and overall quality, ultimately aiding in the realization of its potential to improve the landscape of clinical trials.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Reviewer Expertise:

Clinical research in neurological disorders, primarily Alzheimer's and related dementias, clinical research informatics, NLP systems adoption to improve clinical research, clinical research recruitment, protocol development

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.