Integrating Large Multi-Modal Models for Automated Powdery Mildew Phenotyping in Grapevines

Published:

Yiyuan Lin1, Zachary Dashner2, Ana Jimenez2, Dustin Wilkerson3, Lance Cadle-Davidson4,5, Summaira Riaz2,*, Yu Jiang 5,*

1 School of Electrical and Computer Engineering, College of Engineering, Cornell University, Ithaca, NY 14850, USA
2 San Joaquin Valley Agricultural Sciences Center, United States Department of Agriculture-Agricultural Research Service, Parlier, CA 93648, USA
3 Institute of Biotechnology, Cornell University, Ithaca, NY 14850, USA
4 Grape Genetics Research Unit, United States Department of Agriculture-Agricultural Research Service, Geneva, NY 14456, USA
5 School of Integrative Plant Science, Cornell University, Geneva, NY 14456, USA
* Corresponding authors.

[[Paper]] [Codebase] [Dataset]

Publication

Paper is currently under review and will be released once published.

Codebase

The codebase for this work is open source and public available at SAM-CLIP.

Dataset

The data for this work is open source and public available at PM-SAM-CLIP_AI_in_Ag.


Abstract

Powdery mildew is a major fungal disease of grapevines, yet its field quantification remains constrained by subjective visual assessments, inconsistent scoring protocols, and the high cost of annotated datasets. Large multi-modal models that integrate vision and language offer new opportunities for scalable phenotyping, but applying them to agricultural imagery is challenging due to subtle symptoms and domain shifts across environments. This study presents an automated phenotyping pipeline that combines active illumination imaging with a unified Segment Anything Model-Contrastive Language-Image Pretraining (SAM-CLIP) framework for powdery mildew segmentation and vine-level severity quantification.

The PM-SAM-CLIP model achieved 58.43% image-level mean Intersection over Union (\(mIoU^I\)), substantially outperforming baselines. For canopy segmentation, Canopy-SAM-CLIP reached 95.48% \(mIoU^I\), providing a stable foundation for severity normalization. Applied to independent field datasets without fine-tuning, the pipeline demonstrated strong zero-shot generalization, with over 98% of vines scored within one category of expert ratings. Multi-date analyses further produced spatial and temporal infection maps that captured vineyard-scale heterogeneity and disease progression.

To assess biological relevance, a downstream quantitative trait locus (QTL) analysis was performed using image-derived severity metrics. This analysis identified a major resistance locus on chromosome~13, consistent with QTL mapped from human scouting scores and aligned with previously reported loci, supporting the validity of the automated phenotypes.

Overall, this work demonstrates that large multi-modal models adapted through vision-language integration can generate accurate, reproducible, and scalable phenotypes for plant disease assessment. The proposed pipeline reduces human subjectivity and provides high-quality quantitative traits that support downstream genotype-phenotype analyses in viticulture.

Keywords: Large multi-modal model; foundation model; grape powdery mildew; high-throughput phenotyping; semantic segmentation; SAM; CLIP; QTL analysis; precision viticulture;

Citation

TODO