The AI Wellness Check Ins allow you to analyze your stress, mood, and energy levels in a simple 20 second voice recording.
This is powered by Canary Speech, a speech and language AI company that has partnered with Galvan to bring this technology to millions of people.
According to Canary Speech, this is how the technology works:
“When we do speech analysis, we look at various features hidden in speech and language, and we generally extract 2548 features per 10ms speech, and we investigate the value distribution and correlation. Since we need to figure out the feature value range and characteristics of target disease for patients, we want sufficient data to represent the whole population of thousands of features and find the atypicality of the speech biomarkers. The subject number we generally suggest is 200. This is determined by the previous related disease analysis, which uses a similar number of features, and the repeated sessions are designed to extract related features while ignoring noises from the data. Our previous experience has indicated that a population of 200 is statistically significant for the feature extraction process that Canary Speech will do on a data set.
The types of voice features are in two categories, i.e. acoustic and linguistic features. Acoustic features capture signal-level modulations due to the speaker’s status, while linguistic features capture language-level patterns which may be influenced by the condition.
Acoustic features are calculated on a per-frame basis. Frames are defined as 25 ms sliding windows that are created every 10 ms. A 41-dimensional supervector of various features such as mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), prosody and voice quality related features are generated every frame. Each feature’s delta and delta-delta are concatenated to capture frame-level context. To summarize, we use 19 statistical functions such as mean, median, skewness, kurtosis, quartile, percentile, slope, etc to generate a response-level feature vector.
Language features are based on the results from automatic speech recognition (ASR). We used Canary’s general English model which is trained on publicly available datasets like Tedlium and Librispeech using the time delayed neural network (TDNN) architecture in Kaldi. On top of common features such as part-of-voice ratio, syllable duration, filler ratio and word repetition ratio over the total number of spoken words, we extracted a different feature set whether the response is spontaneous or read.
The feature dimension is around 2.5K. For feature selection, we computed the Pearson’s correlation coefficient between the extracted features and the individual self-assessed measurement scores and then selected top~\textit{n} correlated features for the model. Note that it is done for response and measurement levels so that different features could be selected from different responses depending on the target measurement.
Our Approach:
- Biomarkers: We have identified 2,548 biomarkers in speech. These markers are consistent for a disease across a range of individuals.
- Model Development: We train models using linguistic, voice, and spectral features unique to the targeted disease.
- Identifying Human Conditions: We use models to identify human conditions in individuals without the need of pre-obtained speech data.
Canary’s technology enables the extraction of a superset of features, a machine learning term that is close in meaning to ‘biomarkers’, directly from speech. The majority of Canary’s features are in the acoustic domain, which consists of greater than 700 features such as Spectral features (MFCC, PLP), and prosody features (Pitch, Jitter, Shimmer). Acoustic features are augmented by language features, such as the ratio of filler words (“Umm”) and the duration of vowels. The typical process is to perform feature selection to select a smaller set of relevant features, then perform model training and testing using these features.
Canary’s scientists have spent decades, and in cases their entire careers, in speech, language and machine learning.
Canary Speech Engine is an extension of a generalized speech recognition engine. Speech recognition has traditionally been challenging algorithmic problems, with the implementation of the predictive language modeling being involving graph theory and finite state transducer maths. Canary Speech Engine adds additional feature engineering that is outside of general speech recognition and optimizes speech processing to keep up with real time.
Canary’s approach is unique, the usage of machine learning in health care is widespread, but the use of speech and language as a non-invasive, objective marker for detection of disease is an area with few operatives. Speech and language is ubiquitous and relatively easy to acquire from patients. The potential to improve on current baselines with speech and language solutions is vast.”