Skip to main content

MapApp

General

Inbox

Aim to split this into multiple linked files once a good way to chop it up is clear.

Lovina: showing how to reload RP006 groups…

  • go to GCP
  • select rp006-prod project
  • go to cloud storage
  • select l3 cache bucket
  • when cache manager runs, it drops groups from the redis cache and writes the new ones
    • it writes all these groups to the l3 cache bucket
    • we can delete all folder via the UI by selecting them all and clicking delete
  • see “How To > Access MapApp Prod Data” document for how to access redix-proxy in GCP
    • may need to look up the new external endpoint IP address
    • connect to redis locally
      • have redis cli
      • log in via terminal with command in document
      • the proxy passes along redis commands
      • the host in the command is the external endpoint address
      • look up access token in vault > secret > phenoapp > managed_redis
      • that opens a prompt talking to the redis GCP instance
      • use redis-cli commands to do things
      • redis is a key-value store, basically
      • there’s a key that stores the names of all the available groups:
        • looking in the caches.py, you can see the name of that key is “available-groups”
        • run zrange available-groups 0 -1
          • Output shows all groups available in the bucket
        • to delete all of them, run flushall
    • To repopulate the cache, run the cron job manually:
      • Use kubectl, Lens, k8s etc
      • confirm I’m in the correct cluster
        • e.g. the rp006 app cache manager cron job is actually in the prod-cluster, not rp006-prod
      • in workloads, find the dash-phenoapp-v2-cache-manager-rxrx-cronjob
      • trigger it (starts a manual deployment of the cron job)
      • view the cron job in the list of running workload cron jobs
      • In RP006 Argo:
        • look up rp006-neuro-cache-manager-rxrx-cronjob-config
        • compare its image tag (near top of page under “images” after clicking on a pod) to the latest in eng-infra
      • verify success by looking at the list of successful jobs (after
      • if there’s an issue, check out the GCP logs explorer + look for an error + read its stack trace

Denton & Michael Haines AMA on March 29, 2023…

  • Kafka: event published when group created
    • in phenoservice-api
    • think of Kafka as another API you talk to through consumers
  • Auth:
  • Biggest advice = use the logs
  • Phenoreader = practice downloading and aggregating group data
    • Practice looking at production groups
    • Write a quick script that downloads data from the cache
    • First step to debugging a “why is pert X missing” = see if it’s in the cache data
      • Maybe it’s not in the cache
      • Maybe it’s their app settings
    • Not all in README
    • Denton has a branch that has an example of how to import cache-manager and use it to download data to your local
  • BioHive = practice accessing it
    • it’s a supercomputer
    • you can run things faster using it than using our laptops
    • it can be used remotely
    • it has a dedicated fast network connect to Google Cloud — so the downloads are WAY faster to BioHive than downloading it over my home network
    • a huge amount of the time spent investigating group data is just waiting for things to download
    • see https://github.com/recursionpharma/data-science-onboarding for setup instructions
    • You ssh into it and are in a brand new linux env
      • Need to set up my git credentials, install pyenv, etc
    • port forwarding to biohive to use my laptop browser but be running jupyter lab on biohivesft ssh bh-login001 -L localhost:8888:localhost:8888
  • Troubleshooting scenarios:
    • why does X map look funny?
    • why can’t I access X map?
    • why is X pert/group not in MapApp?
  • Auth:
    • Google groups vs okta groups
    • Pomerium only uses Okta groups
    • We’re just a customer of that system; the security team manages it
    • Ask Ram 🙂
    • Pomerium used for rp006
      • currently no pomerium ingress for the mapapp because it uses catalyst
  • GCP
    • principal cluster
      • used to be named primary cluster
      • where the mapapp is deployed
      • started segmenting things more:
    • rp006-prod cluster
      • includes Pomerium
      • intended for rp006 stuff
    • prod-cluster
      • includes Pomerium
      • new things should go here
      • intended for internal stuff

Science background Qs

  • Who uses the MapApp?
    • used by inference scientists
  • What do they use the MapApp for?
    • investigating monogeneic diseases (diseases involving one gene)
      • referred to by gene
      • experiments use cells where that gene has been turned off by one method or another
        • mimicking disease via gene editing
        • e.g. by CRISPR
      • images are taken of how the phenome (appearance) of the cells change before and after applying the compounds
        • training a deep learning classifier to recognize healthy vs diseased cells (the cell’s “phenoprint”)
        • images are a cheaper way to find promising compounds and fail faster on the rest
      • those images are translated into vectors (arrays of floats)
        • vectors cluster by similar diseases
      • those vectors are translated into cosine similarity scores
    • mapping cosine similarity (angle of vectors as a %)
      • red = similar
      • blue = opposite
    • looking for compounds that successfully counteract those diseases
    • Recursion’s product will eventually be drugs, but currently it’s the analysis of drugs, which is done in part via the PhenoApp

Technical background Qs

Building new groups

  • Cron job: how does the cron job that builds the map work?
    • The cron job is defined in [eng-infrastructure/kube/principal/dash-phenoapp-v2/cache-manager.yaml](https://github.com/recursionpharma/eng-infrastructure/blob/trunk/kube/principal/dash-phenoapp-v2/cache-manager.yaml)
    • It runs every Thursday evening at 6pm MT (1am GMT) and finishes around 7pm
    • The configome.group-auto-loader.prod key defines:
      • transformations: which post-embedding transformations are run on the new data
        • each transformation outputs a new group
        • here’s an example PR adding the _prox_bias_reduced transformation to the list
        • the last transformation in the list will become the first group in the group dropdown
  • Cron job: how to do a dry run (e.g. after updating it)?
    1. Point your configome.yaml to the prod phenoservice API
    2. Make sure that dry run is set to True there
    3. Make sure the transformations & dl model match the production version’s config (in the eng-infrastructure repo).
    4. Then you should just be able to run the script like python phenoapp/group_auto_loader.py

Caching groups

  • Why cache group data?

    • Groups are HUGE — like, ~2 GB or so
    • Requesting
  • L1 cache: what is it + how is it populated?

  • L2 cache: what is it + how is it populated?

  • L2 cache: what is it + how is it populated?

  • How do the MapApp’s different cache layers work?

    L1 cache

    • in-memory python cache
    • 6 dataframes
    • all from default group
    • we have logic to determine most used groups as well, but currently it’s pointless since we don’t have room to add them (since each group includes 8 dataframes when you include split by + normalized variants)

    L2 cache

    • Redis cache
    • most recent 3 built groups
    • group built last becomes default group
      • Conor on team that updates the building logic
    • corresponds to top 3 groups in the Group Label dropdown

    L3 cache

  • How is the MapApp Pandas DataFrame structured?

Adding the right groups to the L1 cache

  • Manually restart pods after cron job completes…
  • After each cron job run adding new groups to the map, we manually restart the pod

General

  • local redis cache
  • Benchmark database
    • no local access without configome changes
    • “structure” endpoints from ci-report
  • Side panel
    • GO (Gene Oncology) terms
  • App updates generated + cached weekly on Monday nights
  • psycopg2 = an ORM to talk to SQL databases using Python
  • we want the app to be useful for for hypothesis generation
    • help scientists narrow from 2.2 trillion inferences to the most useful few
    • we want the app to present an intuitive workflow for narrowing down to these hypotheses
  • we also want to surface novel insights (the Recursion Advantage; things only we’ve discovered) from known insights
    • there’s no competitive advantage to exploring biological relationships our competitors also know about (and may be exploring)

Heatmap

  • Images → Vectors (arrays of 128 floats representing 128 dimensions) → Cosine similarity (angle between two vectors)
  • Vectors
    • normalized to a magnitude (length) of one (to make them comparable)
  • Cosine similarity
    • more similar = red
    • more opposite = blue
  • X-axis = query perturbations, including all concentrations

Projection/Rejection (in right sidebar)

  • select one target perturbation
    • looking for a target gene edit or compound (compounds also include a concentration)
  • graph shows
    • (0, 1) = control (target)
    • looking for lines angled down and to the right (aiming at the target)
      • means a phenosimilar result between target + compound at that concentration (or between target + that gene edit)

Roche partnership

  • RP006 app = app copy for Roche partners that will live on a separate URL
    • same codebase, though
  • one partnership on neuro
    • going to take longer
    • neurons are finicky
  • another partnership on GI-ONC (gastro-intestinal oncology)
    • no map commitment
    • new map ready end of July
  • will need onboarding, docs
  • 9 users
  • no need to support non-Chrome browsers

Bayer partnership

  • maybe an external app
  • currently optimizing

History

  • in 2020, had to use brute force to find promising compounds (couldn’t infer)

Hypothesis generation

  1. Do I trust the gene’s phenoprint?
  • Test by splitting by experiment (to see how consistent the replicates are), viewed with pairwise display
  1. If it’s a bad phenoprint (I don’t trust it), can I find a second gene that’s phenosimilar to the first gene (”in the same pathway”)?
  • If I were to target the second gene with a compound, would it improve the symptoms of the first gene edit?
  1. Using the result of either (1) or (2), can I find a phenoopposite compound that reverses the gene perturbation?
  • Lovina/Summer/Michael good people to ask about cache logic updates
  • Summer/Michael: familiar with the cron job that builds the map

Grafana Loki queries for debugging

Conversation with MH on Sep 14, 2022: