Skip to content
ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

Identification of thresholds for accuracy comparisons of heart rate and respiratory rate in neonates

[version 2; peer review: 2 approved, 1 approved with reservations, 1 not approved]
PUBLISHED 08 Oct 2021
Author details Author details

Abstract

Background: Heart rate (HR) and respiratory rate (RR) can be challenging to measure accurately and reliably in neonates. The introduction of innovative, non-invasive measurement technologies suitable for resource-constrained settings is limited by the lack of appropriate clinical thresholds for accuracy comparison studies.
Methods: We collected measurements of photoplethysmography-recorded HR and capnography-recorded exhaled carbon dioxide across multiple 60-second epochs (observations) in enrolled neonates admitted to the neonatal care unit at Aga Khan University Hospital in Nairobi, Kenya. Trained study nurses manually recorded HR, and the study team manually counted individual breaths from capnograms. For comparison, HR and RR also were measured using an automated signal detection algorithm. Clinical measurements were analyzed for repeatability.
Results: A total of 297 epochs across 35 neonates were recorded. Manual HR showed a bias of -2.4 (-1.8%) and a spread between the 95% limits of agreement (LOA) of 40.3 (29.6%) compared to the algorithm-derived median HR. Manual RR showed a bias of -3.2 (-6.6%) and a spread between the 95% LOA of 17.9 (37.3%) compared to the algorithm-derived median RR, and a bias of -0.5 (1.1%) and a spread between the 95% LOA of 4.4 (9.1%) compared to the algorithm-derived RR count. Manual HR and RR showed repeatability of 0.6 (interquartile range (IQR) 0.5-0.7), and 0.7 (IQR 0.5-0.8), respectively.
Conclusions: Appropriate clinical thresholds should be selected a priori when performing accuracy comparisons for HR and RR. Automated measurement technologies typically use a smoothing or averaging filter, which significantly impacts accuracy. A wider spread between the LOA, as much as 30%, should be considered to account for the observed physiological nuances and within- and between-neonate variability and different averaging methods. Wider adoption of thresholds by data standards organizations and technology developers and manufacturers will increase the robustness of clinical comparison studies.

Keywords

neonatal vital sign measurement, monitoring, heart rate, respiratory rate, accuracy, validation

Revised Amendments from Version 1

Based on helpful feedback from external reviewers, we have updated our manuscript to clarify the methods we used to synchronize the heart rate and respiratory rate data, along with the aims of the study and an updated Figure 2 to include 95% confidence intervals for the upper and lower limits of agreement.

See the authors' detailed response to the review by Gordon B. Drummond
See the authors' detailed response to the review by Kevin Baker
See the authors' detailed response to the review by AbdelKebir Sabil

Introduction

There is a high risk of mortality during the neonatal period, particularly in resource-constrained settings1. Continuous monitoring of neonatal vital signs enables early detection of physiological deterioration and potential opportunities for lifesaving interventions24. The development of new, innovative, non-invasive, multiparameter continuous physiological monitors specifically for neonates offers the promise of improving clinical outcomes in this vulnerable population. However, before use, these technologies should be tested in real-world situations to determine accuracy and clinical feasibility.

A neonate's marked physiological variability, small size, and often fragile condition can offer challenges when measuring and monitoring vital signs. A lack of neonatal clinical validation standards further undermines the development of continuous monitors clinically validated specifically for neonates. Determining the accuracy of new continuous monitors is an essential step in bringing these technologies to market5,6.

The Evaluation of Technologies for Neonates in Africa (ETNA) platform aims to independently establish the accuracy and feasibility of novel continuous monitors suitable for use in neonates in resource-constrained settings7. To determine accuracy and agreement, new technologies are compared against existing reference methods or technologies8. Before the comparison process can proceed, a clinical reference verification step is necessary to determine appropriate accuracy thresholds7. These a priori thresholds determine the target level of agreement required and thus, the success or failure of an investigational technology. This study describes the verification processes we conducted with a clinical reference technology in order to determine appropriate heart rate (HR) and respiratory rate (RR) accuracy thresholds to use in subsequent new continuous monitors accuracy comparisons.

Methods

Study design

This was a cross-sectional study which aimed to identify the natural variation in neonatal HR and RR in order to identify appropriate accuracy thresholds for use in an accuracy comparison of continuous monitors.

Setting and participants

Study participants were neonates admitted for observation and care in the maternity ward, neonatal intensive care, and the neonatal high dependency units at Aga Khan University Hospital in Nairobi, Kenya (AKUHN). Between June and August 2019, caregivers were approached, recruited, and sequentially screened for enrolment by trained study staff during routine newborn intake procedures. To minimize potential selection bias, all caregivers were approached in a sequential manner, as much as possible and introduced to the study using a standardized recruitment script. Final eligibility determination was dependent on medical history results, physical examination, an appropriate understanding of the study by the caregiver, and completion of the written informed consent process (Table 1).

Table 1. Study eligibility criteria and definitions.

Eligibility criteria
Inclusion criteria   •   Male or female neonate, corrected age of <28 days
   •   Willingness and ability of neonate’s caregiver to provide informed consent and to be available for follow-up
for the planned duration of the study
Exclusion criteria   •   Receiving mechanical ventilation or continuous positive airway pressure
   •   Skin abnormalities in the nasopharynx and/or oropharynx
   •   Contraindication to the application of skin sensors
   •   Known arrhythmia
   •   Any medical or psychosocial condition or circumstance that, in the opinion of the investigators, would
interfere with the conduct of the study or for which study participation might jeopardize the neonate’s
health
Study definitions
EpochA 60-second period of time
HeartbeatOne pulsation of the heart, including one complete contraction and dilatation
Heart rate (HR)Number of heart beats within an epoch
BreathOne cycle of inhalation and exhalation
Breath durationLength of time from the start to the end of a single breath
Respiratory rate (RR)Number of breaths initiated within an epoch
Pulse oximetry signal
quality index (PO-SQI)
Automated indicator of signal quality from the plethysmographic recording.
CO2-SQIAlgorithm-defined indicator of signal quality from the capnography channel
AccuracyThe closeness a measured value is from the true value
RepeatabilityThe closeness of the results of successive measurements of the same measure
Agreement (between
measures)
The consistency between two sets of measurements
Accuracy ThresholdA pre-specified value used to determine if a set of measurements has achieved a sufficient accuracy when
compared with a reference value
PrecisionThe closeness of measurements to each other

Study procedures

The Masimo Rad-97 Pulse CO-Oximeter® with NomoLine Capnography (Masimo Corporation, Irvine, CA, USA) was selected as the reference technology based on validated oxygen saturation (SpO2) accuracy measurement in neonates911. During study participation, trained and experienced study nurses attached the Rad-97 to neonates and conducted manual HR measurements (counting over 60-second epochs) every 10 minutes for the first hour and once per hour of participation thereafter, following World Health Organization (WHO) guidance for HR measurement in neonates12. Photoplethysmographic HR was also measured via the Masimo Rad-97 pulse oximetry skin sensor attached to the neonate’s foot. RR was measured by capnography using an infant/pediatric nasal cannula to collect the neonate’s exhaled carbon dioxide (CO2) levels. Duration of data collection length was set at a minimum of one hour, with no upper limit. Neonates exited from the study upon discharge from the ward or by caregiver request.

Data collection and analysis

Using a custom Android (Google, Mountain View, CA, USA) application, raw data was collected from the Masimo Rad-97 in real-time through a universal serial bus (USB) asynchronous connection and parsed in C (Dennis Ritchie & Bell Labs, USA). Instantaneous HR was obtained from the timing of the pulse oximetry signal quality index (PO-SQI). The plethysmogram waveform was sampled at 62.5 Hz with the PO-SQI identified by the Masimo Rad-97 at the peak of each heartbeat. The CO2 waveform was sampled at approximately 20 Hz from the capnography channel. The parsed output included an accurate time stamp for each entry in the waveform data output to facilitate synchronization and analysis. Data were recorded and stored on a secure AKUHN-hosted REDCap server13.

We analyzed the CO2 waveform data using a breath detection algorithm developed in MATLAB (Math Works, USA) and based on adaptive pulse segmentation14. In addition to providing a RR, the algorithm analyzed the waveform’s shape and identified the breath duration (waveform trough to trough) for each breath. From the breath duration, we calculated a RR based on the median breath duration within the epoch. We developed a custom capnography quality score (CO2-SQI) based on capnography features to assist with data selection. HR and RR counts and medians, along with signal quality metrics from the MATLAB signal detection algorithm, were analyzed using R version 4.0.315. Capnogram waveforms were generated with two seconds added at the beginning and end of each epoch to facilitate manual breath counting within the epoch.

To ensure temporal alignment between measurements, HR and RR epochs were synchronized across source data devices. For HR, alignment was done using a timestamp in REDCap that was set by the study nurse as HR counting was initiated. Before analysis, this timestamp was synchronized with the same timestamp in the custom Android application. Both the REDCap and Android servers were connected via the internet to a Network Time Protocol (NTP) server. Alignment of RR epochs was based on the Android application timestamp. All RR waveforms were compared visually to further ensure epoch synchronization.

One of the authors (JMA, a pediatric anesthesiologist) reviewed the capnogram tracings and discarded plots with marked variability or a significant duration of an artifact that would have made breaths challenging to count. The remaining plots were provided to two trained observers to independently count all breaths within each epoch using a set of predefined rules created by the investigators (Table 2). The two independent counts were averaged, and if the number of breaths counted by the two observers varied by more than three breaths per epoch, a third trained observer independently counted the plot, and the two closest counts were averaged.

Table 2. Rules for identifying breaths based on graphical waveform plots.

1. Count peaks of the waveform that are within the white background. Ignore peaks that are within the grey background on either side
of the image.
2. A peak should be counted as a breath when the peak of the waveform is above 15mmHg, the lower horizontal blue line.
3. If the peak does not reach the lower horizontal blue line at 15 mmHg, to be counted as a breath, the peak should reach at least 50%
of the mean peak.
4. The waveform should dip down to the normal baseline (either below 15 mmHg, the lower horizontal blue line, or based on other
breaths). If the waveform does not reach below this point, then this is considered part of the same (double) peak and only counted as a
breath once.

Measurement repeatability was estimated using linear mixed-effects models based on the between- and within-neonate variability for each data source using R version 4.0.316. Agreement between data collection methods was assessed using the method described by Bland-Altman for replicated observations and reported as a mean bias with 95% confidence intervals (CIs), 95% upper and lower limits of agreement (LOA), and as a root mean square deviation (RMSD)17. The aim was to identify practical threshold limits using data from the clinical reference technology verification process.

Sample size

We estimated that 20 neonates with ten replications each would give a 95% CI LOA between two methods of +/-0.76 times the standard deviation (SD) of their differences. Sample size estimates for method comparison studies typically depend on the CI required around the LOA, and sample sizes of 100 to 200 provide tight CIs17. We aimed for a sample size of at least 30 neonates to ensure a diverse population and sufficient replications for tight CIs.

Ethical approval

The study was conducted per the International Conference on Harmonisation Good Clinical Practice and the Declaration of Helsinki 2008. The protocol and other relevant study documents were approved by Western Institutional Review Board (20191102; Puyallup, Washington, USA), Aga Khan University Nairobi Research Ethics Committee (2019/REC-02 v2; Nairobi, Kenya), Kenyan Pharmacy and Poisons Board (19/05/02/2019(078)) and Kenyan National Commission for Science, Technology and Innovation (NACOSTI/P/19/68024/30253). Written informed consent was obtained in English or Swahili by trained study staff from each neonate’s caregiver according to a checklist that included ascertainment of caregiver comprehension.

Results

Between June and August 2019, 35 neonates were enrolled, and 297 clinical observations were completed with a mean of 8.4 (SD 1.7) observations per neonate (Table 3; Figure 1) and a median data collection time of 4 hours, 5 minutes (interquartile range (IQR) 3:52-4:45)18. The manual HR measurements were found to have a non-normal distribution with skewness of 0.76 and kurtosis of 3.60 (p<0.001). The median manual HR measurement for all observations was 134 (IQR 126-143) beats per minute (bpm).

Table 3. Neonate demographic data.

SexAge at participation
(days)
Gestation at
birth (weeks)
Weight at
birth (grams)
FemaleMaleOtherMedianIQRMedianRangeMedianIQR
2213020-43332-3415001260-1600
70f9791e-ee03-41fa-a57f-c3e91f708e0f_figure1.gif

Figure 1. Recruitment flow chart.

The manual HR demonstrated a negative bias of -2.4 (-1.8%) compared to the median PO-SQI HR, and a marked spread between the 95% LOA of 40.3 (29.6%). The RMSD was 10.5 (7.7%). Removing data from a single outlier neonate resulted in a smaller bias of -1.4 (-1.0%), a tighter spread between the 95% LOA of 24.7 (18.2%), and a lower RMSD of 6.4 (4.7%) (Table 4; Figure 2).

Table 4. Bland-Altman analysis of heart rate (HR) and respiratory rate (RR) methods.

Bias
(normalized)
95% upper/
lower limits of
agreement
Spread of 95%
limits of agreement
(normalized)
Root-mean-
square deviation
(normalized)
Heart rate
Manual HR vs median pulse oximetry signal
quality index HR
-2.39 (-1.8%)-22.53/17.7440.27 (29.6%)10.5 (7.7%)
Manual HR vs median pulse oximetry signal
quality index HR (outlier neonate removed)
-1.4 (-1.0%)-13.71/10.9724.67 (18.2%)6.4 (4.7%)
Respiratory rate
Manual RR vs algorithm-derived median RR-3.16 (-6.6%)-12.1/5.817.9 (37.3%)5.5 (11.4%)
Manual RR vs algorithm-derived RR count-0.52 (-1.1%)-2.7/1.664.37 (9.1%)1.2 (2.5%)
70f9791e-ee03-41fa-a57f-c3e91f708e0f_figure2.gif

Figure 2.

Bland-Altman plots comparing manual heart rate (HR) vs median pulse oximetry signal quality index (PO-SQI) HR for all epochs (A), modified manual HR vs median PO-SQI HR with PTID9 removed due to significant outliers (B), manual respiratory rate (RR) vs algorithm-derived median RR (C), and manual RR vs algorithm-derived RR count (D).

Moderate repeatability was demonstrated with approximately 62% (95% CI 47%-73%) of the manual HR variability being due to differences between neonates (Table 5, Figure 3A). Since the 95% CI for manual HR crossed 50%, the between- and within-neonate variability appeared to be comparable, with neither causing significantly more variability than the other.

Table 5. Repeatability results for heart rate (HR) and respiratory rate (RR) measurements for all included epochs.

Repeatability1 (95%
Confidence Intervals)
Heart rate (n=297 epochs)
Manual HR0.62 (0.47-0.73)
Median pulse oximetry signal quality index HR0.75 (0.62-0.83)
Respiratory rate (n=130 epochs)
Manual RR 0.66 (0.47-0.79)
Algorithm-derived median RR0.50 (0.28-0.67)
Algorithm-derived RR count0.66 (0.46-0.79)

1 Repeatability = (between-neonate variance/(between-neonate variance + within-neonate variance))

70f9791e-ee03-41fa-a57f-c3e91f708e0f_figure3.gif

Figure 3. Variability plots (vertical for between-neonate variability, horizontal for within-neonate variability).

Manual heart rate (HR) between-neonate variability accounts for 62% of total variability (A); median pulse oximetry signal quality index (PO-SQI) HR between-neonate variability accounts for 75% of total variability (B); manual respiratory rate (RR) between-neonate variability accounts for 66% of total variability (C); algorithm-derived median RR between-neonate variability accounts for 50% of total variability (D); and algorithm-derived RR count between-neonate variability accounts for 66% of total variability (E).

Manual RR from capnograms were found to have a non-normal distribution with skewness of 0.61 and kurtosis of 2.96 (p=0.027). The median manual RR measurement for all observations was 47 (IQR 39-56) breaths per minute. The manual RR compared to the algorithm-derived median RR showed a negative bias of -3.2 (-6.6%) and a marked spread between the 95% LOA of 17.9 (37.3%). The RMSD was 5.5 (11.4%). Comparing the manual RR to the algorithm-derived RR count showed a smaller bias of -0.5 (-1.1%) and a tighter spread between the 95% LOA of 4.4 (9.1%). The RMSD was 1.2 (2.5%).

The repeatability was moderate with approximately 66% (95 CI 47%-79%) of the manual RR variability due to differences between neonates (Table 5, Figure 3C). Since the 95% CI crossed 50%, the amount of between- and within-neonate variability appeared similar, with neither one resulting in significantly more variability than the other.

Discussion

This reference technology clinical verification study showed minimal measurement bias with a wide spread of 95% upper and lower LOAs and similar repeatability compared with manual clinical measurements. The agreement results allowed us to identify practical HR and RR thresholds for our subsequent technology comparison evaluation. Specifically, we identified a 30% spread between the 95% upper and lower LOA. These a priori-defined thresholds were based on variability observed ten and sixty minutes apart in the same neonate and considered the natural within-neonate physiologic variability. Variability was found to be more marked in some neonates. In part, the 30% spread between 95% upper and lower LOA was selected based on the idea that thresholds should not be more stringent than the observed physiological variability, and in part, based on results from the different averaging methods (manual RR vs algorithm-derived median RR). Given the large difference in results between the two averaging methods, considerable thought should be given prior to choosing an averaging method. A random selection of real clinical data can provide appropriate guidance for selecting suitable neonatal accuracy thresholds.

Of note, one neonate (PTID9) significantly impacted the LOA for HR. Five of nine of this neonate’s manual HR measurements significantly diverged from the same epoch’s PO-SQI HR values and were significantly lower than their mean PO-SQI HR, despite having acceptable signal quality scores. This irregularity suggests a HR reading or data entry error by the study nurse. Removing this neonate’s data and re-analyzing it resulted in a smaller bias and tighter LOAs (Figure 2B).

Results from this clinical verification highlight the difficulty with existing performance thresholds. Current United States Food and Drug Administration performance thresholds for HR measurement, based on electrocardiogram measurements, may not be applicable for use in neonates or when using photoplethysmography for estimating HR19. The current UNICEF target product profile for RR measurement technology recommends a ±2 breaths per minute threshold, which may be too stringent even for use in adults20,21. Using a ±2 breaths per minute recommendation with our RR data would result in a LOA spread threshold of no more than 5%, which is half the LOA spread of our best performing RR comparison. Furthermore, a ±2 breaths per minute or 5% spread in LOA is smaller than random and natural within-neonate physiologic variability (11.5% in this study [unpublished data]) and would result in unrealistically stringent thresholds.

Selecting a performance threshold is challenging. The threshold cannot be too restrictive or inflexible, thereby stifling innovation and preventing new single or multi-parameter continuous monitors from reaching the market. However, too lax a threshold could result in an inaccurate representation of the underlying physiological state. One key limitation is that the true underlying HR or RR is unknown, regardless of the measurement method6,22. The primary goal of this reference technology verification study was to establish a priori thresholds as the first step of our technology comparison evaluation while at the same time understanding that the true underlying RR and HR cannot be known and also recognizing the marked physiologic variability between and within neonates.

In this study, we did not attempt to define or detect clinically meaningful events; instead, we focused on describing non-random thresholds that fall outside of normal physiological variability. We defined HR and RR thresholds based on the difference between the 95% upper and lower LOA. Additional studies will be required to determine if these thresholds translate into improved clinical outcomes.

Performance thresholds identified using this method are influenced by the characteristics of the neonates studied, the data selection methods, and the number of comparisons. For this reason, the thresholds we identified may not be applicable in different neonate cohorts, such as those receiving mechanical ventilation or immediately following birth, among others. Variability will be influenced by disturbances in the environment such as routine procedures, feeding, noise, and time of day. To minimize variability in our data set, we used only RR epochs that appeared to be regular based on visual inspection. Although these segments were selected based on predefined criteria, a majority (167/297) were discarded as the extreme variability seen in some recordings would have made reproducible manual counting of breaths impossible. We have previously demonstrated acceptable agreement between ECG derived HRV and PPG derived HRV in children with an appropriate sampling rate of the PPG. This should be validated in neonates using an ECG23.

Conclusion

Appropriate clinical thresholds should be selected a priori when performing accuracy comparisons for HR and RR. The magnitude and importance of sample size, as well as within-neonate variability requires further investigation. A larger sample size could allow the development of an error model that more clearly describes the error due to various factors such as the measurement technology, averaging method, the observer, and the natural variability of neonatal HR and RR. We strongly support the creation of international standards for technology comparison studies in neonates. These standards should include thresholds for HR and RR based on the specific neonatal population studied and provide details of the experimental conditions, data selection methods, and analysis methods used. Together, such standards would lay the groundwork for a robust continuous monitor comparison field.

Data availability

Underlying data

Dryad: Identification of thresholds for accuracy comparisons of heart rate and respiratory rate in neonates. https://doi.org/10.5061/dryad.1c59zw3vb18.

This project contains the following underlying data:

  • - Coleman-2021-ETNA-DemographicData.csv

  • - Raw data folder (contains all raw capnography and pleth data)

  • - Coleman-2021-ETNA-ProcessedPulseValues.csv

  • - Coleman-2021-ETNA-ProcessedRespirationValues.csv

Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 10 Jun 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
Gates Open Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Coleman J, Ginsburg AS, Macharia WM et al. Identification of thresholds for accuracy comparisons of heart rate and respiratory rate in neonates [version 2; peer review: 2 approved, 1 approved with reservations, 1 not approved]. Gates Open Res 2021, 5:93 (https://doi.org/10.12688/gatesopenres.13237.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 10 Jun 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

Are you a Gates-funded researcher?

If you are a previous or current Gates grant holder, sign up for information about developments, publishing and publications from Gates Open Research.

You must provide your first name
You must provide your last name
You must provide a valid email address
You must provide an institution.

Thank you!

We'll keep you updated on any major new updates to Gates Open Research

Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.