All humans speak words and pause between them for differentiation. That makes him and the hearing person understand what he exactly means.
Though if we playback his speech we may find that he indeed uses a fixed set of 'Vyanjanas' to construct his each word. As an example , the word "virus" is composed of "vi","ra","ss".
As per ancient Indian Sanskrit approach there are about 425-475
fixed set of vyanjanas that a human being can speak. Detecting
all of them and coding is nothing but what we call in the technical
language as 'User Speech Model'.
In VoiceAction if you keep the Sample_processing_resolution to
3000-4000 for a recording frequency setting of 22.5 kHz you will
find vyanjanas of your word. Be very sharp in hearing the playback
of the vyanjana and use comparison oriented easy names like "aa"
, "oo".