Without ever having seen one? Or with having seen pictures of one?
It's true though that we can generalise from descriptions and recognise the real thing from those. If you describe an elephant as a big grey animal with big ears and a trunk they can use to grab stuff, then an adult (not a 2 year old I suspect) seeing one for the first time, will recognise it from that description.
When we see a cartoonish drawing of one, we can still distill the defining characteristics from it and use it to create a description or recognise the real thing. We can recognise a very crude childish drawing of one by looking for these characteristics. We have a lot of additional knowledge that influences our image recognition, and having a big toolbox of general recognition of tons of different objects, we don't really need to train to recognise new objects anymore, because we will distill its identifying characteristics the first time we see it.
Some pattern recognition seems to be innate, for example chicks just hours old can recognize the shadow of a flying bird as either harmless (long neck short tail) or a predator (short neck long tail).
So, while this pattern in particular doesn't apply to humans (it really doesn't?), many animals have ready-to-use pattern recognition when they are just hours or days old.
Those are instincts bred by evolutionary because they're vital for survival. Recognising an elephant is not the same thing for us. Instead, we get a generic toolbox for quickly learning to recognise very different objects, from animals to machines to abstract shapes. We're not born being able to recognise them, but we can pick them up very quickly from even rough descriptions.
True, but if we are into exploring the possibilities of what can be done with artificial neural networks, I see no reason to limit our models to human brains only.