I've been thinking a lot about nearly these exact same things. We desperately need better ways to deal with derived textual data. Why do we make the programmer guess which data structure will work best for a particular task, when we could easily try it each way and record the performance, and pick the best? A big part of the reason must be that we have no good strategy for storing that data and making that choice in an ongoing and living way, inside of a code repository. We suck at dealing with derived data on the meta layer above our programming languages.
Email me at glmitchell[at]gmail if you want to chat more about this.
Email me at glmitchell[at]gmail if you want to chat more about this.