I will start with how the human voice works as it’s easier to understand and works in the same way a vocoder does. There are two main parts involved with speaking, one is the vocal chords and the other is our mouth (specifically the way our tongue and lips move). Our vocals chords really only make noise, this noise is unique to each and every person, we only have control over its volume by pushing more air past it or making it vibrate stronger. At this point you could call this a basic waveform. This waveform then travels past the mouth where (like a filter) we sculpt the waveform into the common sounds of the language you speak (using your tongue and lips).
A vocoder does pretty much the same thing; it uses a carrier signal and a modulator signal in place of the vocal chords and mouth. The carrier, which is commonly a synth, is the signal that will be modulated. The modulator is commonly speech, for the classic ‘robot talking’ but it can be any other signal source for whatever desired sound you are going for (this is fun to experiment with).
Getting more specific, the carrier goes through multiple band pass filters (I will stick to 8 for this guide). They are used to split the signal into 8 frequency regions before the modulation circuit. The ‘modulator’ or ‘voice’ (I will refer to it as ‘voice’ for the guide) goes through another set of 8 band pass filters, each separating the voice into frequency groups like the carrier. Each frequency group of the voice then goes into an envelope follower; this extracts the amplitude voltage of each frequency group (as a whole) individually. The carrier, in its separated frequency groups, then goes into a series of VCA’s (Voltage Controlled Amplifiers), one for each filter group. These 8 VCA’s are ‘amp’ modulated by the 8 envelope followers, this means whatever amplitude in the voice in the lowest frequency will be imposed as the amplitude of the carrier in the lowest frequency. The same happens in every filter channel.
After the carrier has been modulated by the voice those 8 filters groups get summed back into one channel of audio (or two channels if you are using it in stereo), this summed channel is the robot voice created by a vocoder.
More frequency bands will give you more frequency separation which results in more clarity.
For those that can understand schematics here's a simple vocoder example.

So yea, its pretty simple really, not a whole lot of 'magic' going on.