There's a field called Interpretability (sometimes "Mechanistic Interpretability") which researches how weights inside of a neural network function. From what I can tell, Anthropic has the largest team working on this [0]. OpenAI has a small team inside their SuperAlignment org working on this. Alphabet has at least one team on this (not sure if this is Deepmind or Deepmind-Google or just Google). There are a handful of professors, PhD students, and independent researchers working on this (myself included); also, there are a few small labs working on this.
At least half of this interest overlaps with Effective Altruism's fears that AI could one day cause considerable harm to the human race. Some researchers and labs are funded by EA charities such as Long Term Future Fund and Open Philanthropy.
There is the occasional hackathon on Interpretability [1].
Here's an overview talk about it by one of the most-known researchers in the field [2].
Some people (namely the EAs) care because they don't want AI to kill us.
Another reason is to understand how our models make important decisions. If we one day use models to help make medical diagnoses or loan decisions, we'd like to know why the decision was made to ensure accuracy and/or fairness.
Others care because understanding models could allow us to build better models.
At least half of this interest overlaps with Effective Altruism's fears that AI could one day cause considerable harm to the human race. Some researchers and labs are funded by EA charities such as Long Term Future Fund and Open Philanthropy.
There is the occasional hackathon on Interpretability [1].
Here's an overview talk about it by one of the most-known researchers in the field [2].
[0] https://transformer-circuits.pub/2021/framework/index.html [1] https://alignmentjam.com/jam/interpretability [2] https://drive.google.com/file/d/1hwjAK3lWnDRBtbk3yLFL2DCK1Dg...