*Recorded at Perimeter Institute, as part of a monthly webinar series jointly hosted by Perimeter, IVADO, and Institut Courtois: https://pirsa.org/24090157.*

In causal inference, we imagine different “what-if” scenarios—what if an intervention happens, what if it doesn’t—and study the potential outcomes that might result from each scenario to understand the cause-effect relationship.

Consider two luminaries from the 1740s: David Hume and James Lind. David Hume was a *methods* person, sitting in his armchair, like a philosopher does, trying to figure out a good definition of causality. James Lind was an *applications* person, running the first ever randomized trials to find the cure for scurvy.

These two people exemplify the two approaches to causal inference: **top down** and **bottom up**.

Modern causal inference is a partnership between **methods people** and **applications people**. You need both because if you’re an applications person and you don’t have rigorous theory, you can do the wrong thing. And if you’re a theory person, without applications, you can go off in directions that aren’t connected to any real problems.

In causal inference, we want to understand the effect of a particular action or intervention (the *cause*) on an outcome (the *effect*).

We conceptualize cause-effect relationships by means of (*random variable*) responses to hypothetical interventions (also known as a *counterfactual**random variable* or a *potential outcome*).

There are various notations for this, one of which is \(Y(a)\)

Note that it is very different from \(Y\vert a\) which reads “\(Y\)—if we observed A—to have value \(a\)”. This distinction is a mathematical reflection of the familiar saying that “correlation is not causation”.

The way people think about cause-effect relationships is in terms of hypothetical interventions that set a cause whose effect we’re interested in to a particular value (hypothetically), set it to a different value (hypothetically), and compare the outcomes. **This is useful in real-life scenarios, where you are unable to do randomized trials.** So you do this kind of special massaging of the data, which we call *adjusting for confounders*.

There are at least two schools of thought on Causal Modelling: *Potential outcomes* and *Causal Graphs*. Potential outcomes specifies causal assumptions algebraically, emphasizes (potentially counterfactual) random variables, and is common in statistics and public health. Causal graphs specifies causal assumptions graphically, emphasizes operators (eg the `do()`

operator) and structural equations, and is common in computer science (and CMU *Single World Intervention Graphs* (SWIG).

*(From Ilya’s POV)*

Start by talking to somebody who does applications, come up with a cause-effect question of interest, then try to express it mathematically in terms of a causal parameter, usually as a hypothetical experiment, such as the average causal effect*average treatment effect*, measures the difference in mean outcomes between units assigned to the treatment and units assigned to the control (source: Wikipedia).

Elicit a causal model—which encodes assumptions we’re willing to make about the problem—from an expert, or possibly learn the model from data.

Do *identification*, which means asking if a causal parameter can be uniquely expressible in terms of available data given the causal model. This is basically checking that the observed data likelihood that we have in the problem actually has information that can be brought to bear on what we’re asking. That’s very important.

If the parameter can be expressed in terms of observed data, we think about estimation, or statistical inference, or learning (if we’re a machine learning person). This means we ask how we can construct a procedure that makes a good guess about our identified parameter from data. We might be concerned with doing it efficiently or doing it robustly.

Causal parameters are generally *counterfactual*. Because of this, it’s very difficult to validate whether we did a good job estimating them (unlike in machine learning and supervised learning problems where validation can be accomplished using, eg, holdout data

Quantifying uncertainty is important, which can be done by confidence intervals or credible intervals.

Causal inference is a version of a missing data problem. Vice versa, you can think about missing data problems causally.

In causal inference, the fundamental object is:

What would be the outcome if, hypothetically, I set the treatment to some value, possibly contrary to fact.

In missing data, we are interested in:

What would be the outcome if we could hypothetically observe it, even though in reality it may be missing.

The principled methods from missing data have a strong relationship to principled methods from causal inference. They are kind of sister disciplines.

Machine learning (ML) is a very optimistic field. They have a “let’s just build it and see if it works” attitude, even if they can’t prove it will always work, which is admirable. They publish quickly and often, and the progress is clear, as the recently-developed large language models (LLMs) have shown.

Causal inference (CI) people, on the other hand, are conservative and constructively pessimistic. Since it’s difficult for them to validate their results, they assume that they are wrong by default. They spend a lot of time thinking about: robust methods; validation; sensitivity analysis (checking if violating their assumptions still gives them the answer they expect). This is a valuable attitude in applied problems, where things don’t work the way you would expect.

ML people tend to think about finite sample results. CI people tend to think about asymptotic identification and estimation results. These are good complements to each other

People in ML emphasize tasks and validation of results, which is very useful. On the other hand, people in CI aren’t able to validate their findings (you can’t observe things that didn’t happen), so they try to be very transparent about their assumptions and put them upfront. Both are very important to how we should do science.

Causal inference people know that often the best way to approach problems with infinite-dimensional parameters is to use semi-parameteric theory. A lot of folks in ML aren’t aware of this. Ilya believes that semi-parametric theory is the correct approach for many problems that arise in ML, such as problems in model based reinforcement learning for example.

In saying that, people in ML have thought about predictive modelling really hard for the last 50 years and they have excellent predictive performance. This can be very helpful as a subroutine for semi-parametric**ML definitions:** A fully parametric model is a model where the number of parameters is fixed (eg linear regression model). A non-parametric model is a model where the number of parameters grows with how much data you have (eg random forest). A semi-parametric model is in between. **Formal statistical theory definitions:** Statistical models are all about the tangent space. The tangent space is the space of scores, where a score is the derivative of the log-likelihood wrt the parameters. A lot of the behaviour of learning has to do with the tangent space. A parametric model is when the tangent space is Euclidian because the number of dimensions equals the number of parameters—they are just fixed. Semi-parametric models have infinite dimensional tangent spaces with equality restrictions. If you want to do ML, you are in the semi-parametric regime. There you have very wiggly, flexible surfaces, and you can not do linear regression—you have to use big neural networks.

People in CI worry about parameter identifiability, which is the very important problem of determining if your data has information about your problem. In ML, there are some model fragility issues related to the fact that parameters in ML are not identified because the models are over-parameterized (deep learning systems have billions of parameters).

CI isn’t a supervised problem, so CI people have developed principled ways of thinking about problems that are unsupervised. Many problems in ML also go beyond the traditional supervised framework, even though they are sometimes approached as if they were purely supervised. For example, in a prediction task, if some outcomes are unobserved, this becomes a missing data issue, which shifts the problem away from a strictly supervised setting. Acknowledging these nuances can help both fields refine their approaches.

Causal language is very helpful when thinking about model stability, invariance, fairness, interpretability. This has been noticed in a lot of ML papers that use CI.

CI and missing data are sister disciplines, so causal methods give you principled approaches for thinking about missing data. This is particularly important since most applied analyses will inevitably encounter some degree of missing data. In practice, it’s common to see quick fixes, such as using a standard imputation package in Python or R. But it’s crucial to go beyond thinking of missing data as a data cleaning issue. Different types of missingness require different strategies.

People in ML have powerful optimization methods which are likely helpful for solving estimating equations that arise in semi-parametric inference.

A big lesson from ML is that surprisingly many problems in life are regression problems, which neural networks are well-suited to handle.

But it’s important to acknowledge that not every problem can be approached as a regression problem. While LLMs have generated excitement due to their impressive capabilities, there’s a growing realization that they have limitations. For example, LLMs can sometimes produce misleading or overly optimistic responses and may struggle with highly structured problems, such as proofs, chess, or tasks involving causal inference. This isn’t surprising, as different types of problems—like planning, logical reasoning, and causal inference—have specialized methods that are better suited to their unique demands. LLMs, despite their advanced transformer architectures, are not designed to excel at every type of problem, and it’s important to apply the right tools to the right challenges.

Causal inference can be used to emulate randomized control trials in cases where its either unethical (like medicine) or too expensive (like in economics

But it’s not like you get this for free. You get this by making assumptions about the causal model. Those assumptions might not be exactly right so you have to be upfront about your assumptions and pitch your result as: if you believe this list of assumptions, then this is our conclusion, but if you do not, then you should believe it less.

*(*not because it’s any less interesting, just not where my head is at right now)*

- The state of the art in causal inference
- An example of the causal inference pipeline on a project he is working on with cardiac surgeons
- Open problems in causal inference
- State of the art with “post-selected data”, which means something different in causal inference and quantum information communities (during question time)
- Areas for collaboration between ML & CI (during question time)
- How causal language lets you assess stability (during question time)

It’s a really nice talk. I recommend it. You can watch it here.

]]>My first thought after finishing this book was: I wonder how hard it would be to create a website called `shouldhavebeenablogpost.com`

\(^1\) where its only purpose is for users to vote on whether a book should have been a blogpost or not.

The Geek Way by Andrew McAfee should have been a blog post.

In saying that, there were a couple of absolute gems that really stood out for me, which made the book worthwhile. But before we get to those, what is the book about?

The book is about the ways in which the Silicon Valley “geeks” run their modern-tech-era companies, and why those ways are better than the ways in which non-Silicon-Valley non-geeks run their industrial-era companies.

The author distills those ways into four principles, which he calls the **Four Geek Mantras**:

- Science: Argue about evidence
- Ownership: Align, then unleash
- Speed: Iterate with feedback
- Openness: Reflect, don’t defend

Principles 1 and 3 are pretty obvious to anyone with some interest in Silicon Valley tech companies (Principle 1 says use data to make decisions while Principle 3 says use agile over waterfall).

Principle 4 says that people respond to peer pressure, so make sure the peers in your company pressure each other to do things that are aligned with the values of the company.

Principle 2 is interesting though. It’s about bureaucracy and how it sucks. That’s no surprise to anyone. But McAfee’s explanation for where it comes from is gold:

The ultimate explanation is that dense bureaucracy is the result of status seeking by us status-obsessed [humans].

We invent work so that we can be part of it.We strive to be consulted on lots of decisions, and if possible have veto power over them. Excess bureaucracy is a bug for anyone who wants a company to run efficiently, but it’s a feature for the [humans] who seek opportunities to gain status in the organization. \(^2\)

I mean, intuitively I knew this, but I had never seen it spelled out so eloquently.

But it gets better, because McAffee points out something I *didn’t* know, not even on an intuitive level. The way to prevent this kind of thing from happening is to **create** silos within the organization. Excuse me, what? My whole professional life, I had thought that the key to fixing organizations was to **break down** silos…to encourage communication. But McAfee makes a very good case for why the opposite is true: “cross-team communication can be harmful because it often turns into a soft form of bureaucracy.”

But if you create silos and cut off communications between teams, how will people figure out what to? You make sure that the company’s high-level vision and strategy are clear and that everyone in the chain of command knows how to, and is incentivized to, translate that vision and strategy into clear team-level objectives and key results.

I think we all know that clear vision and clear objectives would make our lives easier, but the point here is that they are absolutely critical if you want to avoid the *soft bureaucracy of over-communication*.

The other thing in the book that stood out for me was in the discussion of Principle 4, the one about peer pressure. Specifically: how should we define a peer group?

St. Augustine says *love*: “a people is an assemblage of reasonable beings bound together by a common agreement as to the objects of their love.”

Anton Chekhov says *hate*: “love, friendship, respect do not unite people as much as common hatred for something.”

Andrew McAffee says *norms*: “whether we love the people around us or hate them, what unites them and us into a coherent group is what we’ve collectively decided to punish with painful social rejection—with the threat or reality of ostracism from the group. A big part of what unites us, in other words, is our norms.”

This piqued my interest in a different context from the one in which it was presented.

I talk with a lot physics academics who are considering, or already pursuing, a career change from academia to industry. Due to the strong norms within academia, and the fear of ostracism, this kind of transition can be really traumatic for people. But then, after they leave and align themselves with a new group, it’s hard to remember what all the fuss was about. The discussion in this book shed some light on how this happens.

I felt that most of the book was pretty obvious and self-evident, and that most of the extended discussion of the four principles didn’t add much texture or depth to the high-level points.

In saying that, there were a couple of gems that really stood out for me.

So I think it was worth reading, but if I knew then what I know now, I would have read the one-page chapter summaries first and only dug into the parts of the book that seemed novel to me.

What do you think?

\(^1\) I checked and the domain was registered recently. I wonder what book triggered the owner to buy it!

\(^2\) In the book, McAfee coined the term *Homo Ultrasocialis* which he uses to refer to humans, but I didn’t want to go into that here.

But it didn’t take long to notice that a lot of what I read about startups didn’t seem to apply to what I was seeing first-hand.

Eventually, I figured out what was up: quantum computing startups are not *regular* startups. Quantum computing startups are *deep tech* startups.

So what’s the difference between a regular startup and a deep tech startup?

Before diving into that question, we first need to understand how a startup is different to a small business:

- A small business is designed to grow
*organically*while a startup is designed to grow*inorganically*. - A small business is designed to
*earn revenue and spend it*while a startup is designed*to raise capital and burn it*.

(For a great discusion of this, check out the first episode of The Startup Podcast with Yaniv Bernstein & Chris Saad.)

Another way to think about it is that for a small business, there is a fairly linear relationship between capital investement and profit, while for a startup, the relationship is highly nonlinear. If things go well for a startup, the non-linearity will exhibit an inflection point where the profit starts (a hockey stick curve).

This distinction is important because it affects how you should run your business. If you’re a small business and you operate like a startup (or vice versa), you’re going to have a bad time.

So what does it mean to “burn capital”? Is it just a fancy way to say “lose money”? Not at all. Burning capital is a conscious decision to spend money on growing your business *inorganically* until it reaches the point where it can generate profit.

Here are two scenarios in which this makes sense.

The first is when your product is built using software and relies on a huge number of users to generate profit. With software, the more people you serve, the cheaper it is to serve each person and you’re able to create outsized returns at scale. But while you grow your user base, you may need to offer your product at a loss. So you have to burn capital until you “reach scale”…that is, until you have enough users to generate profit.

The second is when your product relies on technology that is extremely complex, takes a long time and a lot of money to develop and commercialize, but will be very valuable once it exists. You might not need scale effects to generate profit, but you do need a product. So you have to burn capital while you develop and commercialize the underlying technology.

The first approach is how most regular (aka traditional, aka silicon-valley style, aka shallow-tech) startups work. The second approach is how deep tech startups work.

So why does this distinction matter?

In the (regular) startup world, the recommended approach is to build your company around a problem that needs to be solved, not around a product or a technology. There are sound reasons for taking this approach, and it’s completely feasible to do so if the solutions to your problem use well-developed technology like software. Once you’ve identified a problem to solve, you have a plethora of software tools and experts to develop the solution, and test it on users, in a reasonable amount of time.

This isn’t the case for the kinds of technologies being developed by deep tech startups. The underlying technology is usually developed during one or multiple people’s PhD degree/s, possibly based on years of prior work in the PI’s lab, then developed even further outside of the lab on its way towards commercialization. It’s usually not practical to start from a problem and then ask someone to spend 20 years developing a deep-tech solution to it. Instead, deep tech startups start with the technology, *then* commercialize it (i.e. develop the product and move it to market). The terminology for this is that deep-tech startups are “technology first”.

These are drastically different approaches to product development: having a problem and looking for a solution vs. having a solution and looking for a problem.

The fact that deep tech startups are technology-first has several important flow-on effects, such as longer time scales, higher and different kinds of risk, and the kinds of investors they should work with. You can read more about all of this in the following articles:

- Commercializing deep tech startups (Tech Crunch) by Vin Lingathoti
- Deep Tech vs General Tech (LinkedIn Pulse) by Fateh Ali
- What is different about deep tech startups? (MIT orbit) by Elaine Chen

All of this brings up some interesting questions:

- Is every quantum computing startup a deep tech startup? (If they’re building hardware, probably. If they’re building software, it depends. If they’re providing services, probably not.)
- Does it make sense for quantum software companies to use the kinds of agile methodologies used by regular software companies if the time-scales aren’t short enough to iterate quickly? Or should they revert to traditional project management techniques common in engineering, like waterfall? Or are there other approaches that make more sense?
- What metrics should quantum computing startups (and quantum computing orgs) use to gauge whether they’re on the right track? For regular businesses, revenue growth is a good metric. For regular startups that haven’t reached scale yet, user growth is a good metric. But what’s a good metric for deep tech startups, specifically quantum computing startups?

What do you think?

There’s a lot of material out there about how to successfully run a startup, but most of it is aimed at regular software startups. Many quantum computing startups, on the other hand, are deep-tech startups which are “technology first”. In any kind of startup, *product-market fit is critical for success*, but the approaches can be different: regular startups can (and should) start by identifying a problem, then work on finding a solution for that problem, while deep tech startups are forced to start with the solution (the technology), and must find a problem that’s solved by that technology. This has some interesting flow on effects and raises some interesting questions. If you’re running or working for a quantum computing startup (or organization), it’s important to know the difference so you don’t apply the wrong lessons from regular startups.