
机器学习在山西省农村地区蛋白尿筛查中的初探
芦园月, 李子良, 李旺鑫, 刘艳琴, 李荣山, 周晓霜
机器学习在山西省农村地区蛋白尿筛查中的初探
Preliminary study of machine learning in the screening of proteinuria in rural areas of Shanxi province
目的 筛查山西省农村地区蛋白尿的发生率,构建基于机器学习算法的蛋白尿风险预测模型。 方法 该研究为横断面调查研究。采用多阶段分层抽样方法,筛查2019年4—11月山西省8个地级市(太原、阳泉、临汾、运城、吕梁、晋中、晋城、忻州)农村地区≥30岁的居民,收集居民的问卷调查、体格检查和实验室检查资料。蛋白尿定义为尿白蛋白/肌酐比值≥30 mg/g,统计蛋白尿的发生率。将参与者分为蛋白尿组和无蛋白尿组,分别采用基于堆叠法构建的Logistic回归集成算法(SE-LR)、Logistic回归、支持向量机、决策树、随机森林和极端梯度提升构建蛋白尿和无蛋白尿的机器学习二分类模型。采用受试者工作特征曲线下面积、精准率、召回率和F1加权值评估各模型的预测效能,最后对综合性能最优模型预测特征的重要性排序。 结果 该研究纳入8 869名农村地区居民,年龄为(58.59±9.49)岁,男性3 872例(43.66%),女性4 997例(56.34%),蛋白尿的发生率为13.49%(1 196/8 869)。蛋白尿组参与者血压、脉搏、体重指数、腰围、肥胖或超重比例、高血压比例、摄盐量中重度比例、糖化血红蛋白、尿酸碱度、尿比重、尿潜血阳性比例、尿糖阳性比例、尿酮体阳性比例、尿红细胞数≥5个/μl比例、尿白细胞数≥10个/μl比例及尿α1微球蛋白均高于无蛋白尿组,缺乏运动比例和饮酒史比例均低于无蛋白尿组(均P<0.05)。评估多个模型,SE-LR模型综合性能最优,曲线下面积(0.736,95% CI 0.719~0.746)略低于Logistic回归模型(0.745,95% CI 0.680~0.762),精准率(0.844)、召回率(0.621)及F1加权值(0.801)最高。SE-LR模型中,前10个特征的重要性排序分别为尿α1微球蛋白、尿潜血、尿糖、尿酸碱度、吸烟、超重或肥胖、体重指数、总胆固醇、糖化血红蛋白及高血压。 结论 山西省农村地区蛋白尿的发生率较高,通过机器学习建立的蛋白尿风险预测模型可以预测蛋白尿的发生风险,并识别其风险因素,可在一定程度上为社区和临床的疾病预防、干预和治疗提供科学依据。
Objective To screen the incidence of proteinuria in rural areas of Shanxi province and construct a risk prediction model of proteinuria based on machine learning algorithm. Methods It was a cross-sectional investigation study. The residents ≥30 years old in rural areas of Shanxi province from April to November 2019 were screened by multi-stage stratified sampling method, and data from questionnaire surveys, physical examinations, and laboratory examinations were collected. Urine albumin/creatinine ratio ≥30 mg/g was defined as proteinuria, and the incidence of proteinuria was calculated. Subjects were divided into proteinuria group and non-proteinuria group. The machine learning binary classification model of proteinuria and non-proteinuria was constructed based on the stackable integrated logistic regression algorithm (SE-LR), logistic regression, support vector machine, decision tree, random forest and extreme gradient lift algorithms, respectively. The area under the receiver operating characteristic curve, accuracy, recall, and F1 weights were used to evaluate the predictive efficiency of the comparison models. Finally, the importance of the predictive features of the model with the best overall performance was ranked. Results There were 8 869 rural residents included in the study, aged (58.59±9.49) years old, with 3 872 males (43.66%) and 4 997 females (56.34%). The prevalence of proteinuria in rural areas of Shanxi province was 13.49% (1 196/8 869). Blood pressure, pulse, body mass index, waist circumference, proportion of obesity or overweight, proportion of hypertension, proportion of moderate to severe salt intake, glycosylated hemoglobin, uric pH value, urinary specific gravity, proportion of positive urinary occult blood, proportion of positive urinary glucose, proportion of positive urinary ketone body, proportion of urinary red blood cell count ≥5/μl, proportion of urinary white blood cell count ≥10/μl and urinary α1 microglobulin in the proteinuria group were all higher than those in the non-proteinuria group (all P<0.05). The proportions of lack of exercise and drinking history in the proteinuria group were lower than those in non-proteinuria group (both P<0.05). The overall performance of SE-LR model was the best, with the area under the curve (0.736, 95% CI 0.719-0.746) slightly lower than that of the logistic regression model (0.745, 95% CI 0.680-0.762), and the highest accuracy (0.844), recall rate (0.621) and F1 weighting value (0.801). In the SE-LR model, the orders of importance of the top 10 features were urinary α1- microglobulin, urinary occult blood, urinary sugar, uric acid basicity, smoking history,overweight or obesity, body mass index, total cholesterol, glycosylated hemoglobin and hypertension. Conclusions The prevalence of proteinuria is high in rural areas of Shanxi province. The risk prediction model of proteinuria established by machine learning algorithm can predict the risk of proteinuria and identify its risk factors, which can provide a scientific basis for disease prevention, intervention, and treatment in the community and clinic to a certain extent.
蛋白尿 / 机器学习 / 肾疾病 / 危险因素 / 山西省 {{custom_keyword}} /
Proteinuria / Machine learning / Kidney diseases / Risk factors / Shanxi province {{custom_keyword}} /
表1 蛋白尿组和无蛋白尿组基线特征比较 |
项目 | 总体(n=8 869) | 无蛋白尿组(n=7 673) | 蛋白尿组(n=1 196) | 统计值(χ²/t/Z) | P值 |
---|---|---|---|---|---|
女性[例(%)] | 4 997(56.34) | 4 317(56.26) | 680(56.86) | 0.148 | 0.700 |
年龄(岁) | 58.59±9.49 | 58.62±9.50 | 58.40±9.47 | -0.752 | 0.443 |
收缩压(mmHg) | 134.00(122.50,147.50) | 133.50(122.00,147.50) | 136.00(124.50,147.63) | -2.277 | 0.023 |
舒张压(mmHg) | 81.5(76.5,89.5) | 81.5(76.5,89.5) | 83.0(77.0,90.0) | -2.875 | 0.004 |
脉搏(次/min) | 75.0(69.0,82.0) | 75.0(69.0,82.0) | 75.5(70.0,83.0) | -2.294 | 0.022 |
体重指数(kg/㎡) | 24.70(22.66,26.89) | 24.49(22.52,26.67) | 25.95(23.74,28.30) | -12.741 | <0.001 |
腰围(cm) | 85(80,90) | 84(80,90) | 86(80,93) | -5.832 | <0.001 |
缺乏运动[例(%)] | 5 139(57.94) | 4 485(58.45) | 654(54.68) | 6.033 | 0.014 |
肥胖情况[例(%)] | -6.771 | <0.001 | |||
正常 | 3 646(41.11) | 3 316(43.22) | 330(27.59) | ||
肥胖 | 1 474(16.62) | 1 152(15.01) | 322(26.92) | ||
超重 | 3 749(42.27) | 3 205(41.77) | 544(45.48) | ||
脑卒中[例(%)] | 1 085(12.23) | 936(12.20) | 149(12.46) | 0.065 | 0.799 |
高血糖[例(%)] | 411(4.63) | 353(4.60) | 58(4.86) | 0.145 | 0.703 |
高血压[例(%)] | 3 903(44.01) | 3 330(43.40) | 573(47.91) | 8.544 | 0.003 |
脑血管病史[例(%)] | 554(6.25) | 483(6.29) | 71(5.94) | 0.227 | 0.634 |
糖尿病[例(%)] | 843(9.51) | 718(9.36) | 125(10.45) | 1.440 | 0.230 |
房颤或瓣膜性心脏病[例(%)] | 51(0.58) | 43(0.56) | 8(0.67) | 0.213 | 0.644 |
冠心病[例(%)] | 786(8.86) | 681(8.88) | 105(8.78) | -0.109 | 0.913 |
吸烟史[例(%)] | 1 909(21.52) | 1 677(21.86) | 232(19.40) | 3.701 | 0.054 |
饮酒史[例(%)] | 1 171(13.20) | 1 076(14.02) | 95(7.94) | 33.377 | <0.001 |
血脂异常[例(%)] | 4 818(54.32) | 4 173(54.39) | 645(53.93) | 0.087 | 0.769 |
摄盐量[例(%)] | -2.620 | 0.009 | |||
轻度 | 2 234(25.19) | 1 913(24.93) | 321(26.84) | ||
中度 | 5 492(61.92) | 4 738(61.75) | 754(63.04) | ||
重度 | 1 143(12.89) | 1 022(13.32) | 121(10.12) | ||
膳食结构[例(%)] | 2.105 | 0.349 | |||
动物性食物为主 | 424(4.78) | 376(4.90) | 48(4.01) | ||
动植物平衡 | 5 630(63.48) | 4 873(63.51) | 757(63.29) | ||
植物性食物为主 | 2 815(31.74) | 2 424(31.59) | 391(32.69) | ||
三酰甘油(mmol/L) | 1.54(1.10,2.14) | 1.54(1.10,2.15) | 1.51(1.11,2.13) | -0.445 | 0.656 |
总胆固醇(mmol/L) | 4.35(3.74,5.00) | 4.36(3.74,5.01) | 4.33(3.72,4.97) | -0.923 | 0.356 |
低密度脂蛋白(mmol/L) | 2.29(1.76,2.86) | 2.29(1.76,2.86) | 2.24(1.75,2.88) | -0.668 | 0.504 |
高密度脂蛋白(mmol/L) | 1.25(1.09,1.45) | 1.25(1.09,1.45) | 1.26(1.08,1.44) | -0.530 | 0.596 |
同型半胱氨酸(μmol/L) | 18.30(13.50,28.10) | 18.30(13.50,27.80) | 18.60(13.48,29.13) | -0.855 | 0.392 |
空腹血糖(mmol/L) | 4.7(4.3,5.3) | 4.7(4.2,5.3) | 4.8(4.3,5.3) | -1.147 | 0.252 |
糖化血红蛋白(%) | 5.3(5.0,5.7) | 5.3(5.0,5.7) | 5.4(5.0,5.8) | -4.372 | <0.001 |
尿酸碱度(pH值) | 6.0(5.0,6.5) | 6.0(5.0,6.5) | 6.0(5.0,6.5) | -4.440 | <0.001 |
尿比重 | 1.02(1.01,1.02) | 1.02(1.01,1.02) | 1.02(1.01,1.02) | 4.620 | <0.001 |
尿潜血[例(%)] | -9.089 | <0.001 | |||
- | 6 195(69.85) | 5 482(71.45) | 713(59.62) | ||
+ | 2 023(22.81) | 1 699(22.14) | 324(27.09) | ||
2+ | 554(6.25) | 431(5.62) | 123(10.28) | ||
3+ | 97(1.09) | 61(0.79) | 36(3.01) | ||
尿白细胞[例(%)] | -1.653 | 0.098 | |||
- | 7 028(79.24) | 6 096(79.45) | 932(77.93) | ||
+ | 687(7.75) | 606(7.90) | 81(6.77) | ||
2+ | 556(6.27) | 486(6.33) | 70(5.85) | ||
3+ | 598(6.74) | 485(6.32) | 113(9.45) | ||
尿糖[例(%)] | -14.494 | <0.001 | |||
- | 8 556(96.47) | 7 488(97.59) | 1 068(89.30) | ||
± | 1(0.01) | 0(0) | 1(0.08) | ||
+ | 93(1.05) | 60(0.78) | 33(2.76) | ||
2+ | 81(0.91) | 48(0.63) | 33(2.76) | ||
3+ | 58(0.65) | 35(0.46) | 23(1.92) | ||
4+ | 43(0.48) | 25(0.33) | 18(1.51) | ||
5+ | 37(0.42) | 17(0.22) | 20(1.67) | ||
尿酮体[例(%)] | -2.289 | 0.022 | |||
- | 8 759(98.76) | 7 586(98.87) | 1 173(98.08) | ||
± | 99(1.12) | 77(1.00) | 22(1.84) | ||
+ | 8(0.09) | 7(0.09) | 1(0.08) | ||
2+ | 3(0.03) | 3(0.04) | 0(0) | ||
尿胆原[例(%)] | -0.088 | 0.930 | |||
- | 8 855(99.84) | 7 661(99.84) | 1 194(99.83) | ||
+ | 8(0.09) | 7(0.09) | 1(0.08) | ||
2+ | 3(0.03) | 2(0.03) | 1(0.08) | ||
3+ | 2(0.02) | 2(0.03) | 0(0) | ||
4+ | 1(0.01) | 1(0.01) | 0(0) | ||
尿红细胞数≥5个/μl[例(%)] | 416(4.69) | 326(4.25) | 90(7.53) | 24.847 | <0.001 |
尿白细胞数≥10个/μl[例(%)] | 471(5.31) | 375(4.89) | 96(8.03) | 20.281 | <0.001 |
尿α1微球蛋白(mg/L) | 10.5(5.8,16.9) | 10.0(5.6,15.7) | 15.3(9.1,26.6) | -18.375 | <0.001 |
表2 机器学习模型的分类性能评估 |
模型 | AUC | 95% CI | 精准率 | 召回率 | F1加权值 |
---|---|---|---|---|---|
LR | 0.745 | 0.680~0.762 | 0.842 | 0.612 | 0.742 |
SVM | 0.733 | 0.690~0.751 | 0.838 | 0.571 | 0.712 |
DT | 0.654 | 0.634~0.675 | 0.829 | 0.531 | 0.694 |
RF | 0.756 | 0.697~0.800 | 0.839 | 0.094 | 0.365 |
XGBoost | 0.712 | 0.680~0.730 | 0.834 | 0.308 | 0.560 |
SE-LR | 0.736 | 0.719~0.746 | 0.844 | 0.621 | 0.801 |
1 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
2 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
3 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
4 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
5 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
6 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
7 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
8 |
National Kidney Foundation. K/DOQI clinical practice guidelines for chronic kidney disease: evaluation, classification, and stratification[J]. Am J Kidney Dis, 2002, 39(2 ): S1-S266.
Suppl 1
{{custom_citation.content}}
{{custom_citation.annotation}}
|
9 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
10 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
11 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
12 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
13 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
14 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
15 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
16 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
17 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
18 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
19 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
20 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
21 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
22 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
http://journal.yiigle.com/LinkIn.do?linkin_type=cma&DOI=10.3760/cma.j.cn441217-20221028-01041
所有作者声明无利益冲突
芦园月负责实验设计和论文撰写;李子良负责数据收集;李旺鑫和刘艳琴负责图表制作和数据分析;李荣山负责论文质量控制;周晓霜负责论文质量控制及实验设计指导
/
〈 |
|
〉 |