5 Data wrangling

Once data have been spike sorted, we are ready to begin working in R. To get to a point where meaningful preliminary plots can be produced, a few things need to be addressed:

Labeling the time series of spike & photodiode data based on the stimulus that appears on screen (i.e., matching the log file to the data file). This includes labeling phases (like “blank”, “stationary”, and “moving”) along with experimental metadata such as SF, TF, and stimulus orientation (direction).
Re-organizing the spike & photodiode so that separate replicates of a stimulus run are uniquely labelled and then arranged together.
Binning the data into 10- and 100-ms data sets, and then exporting CSVs of the unbinned, 10-ms-binned, and 100-ms-binned data. This will be highly useful for situations where you are handling multiple data files (e.g., from different recording days), in which case it is likely that your machine will not be able to store all objects in RAM.

Before proceeding any further, please ensure you have installed and loaded all the necessary R packages as detailed in the Preface section of this guide.

5.1 Import example file and metadata

We will use a recently-collected data file and its corresponding metadata file to showcase the fundamentals of wrangling ephys data into a more easily plot-able format.

5.1.1 Identify files to import

The following code is based on the assumptions that:

Your files are stored in a directory entitled /data
The basename of each file (i.e., the name of the file, excluding the file extension) is identical for each set of spike sorted data and corresponding metadata log file (e.g., 2023-02-16_001.mat and 2023-02-16_001.csv have the same basename, which is 2023-02-16_001).

## List all files of each file type
csv_files <-
  list.files("./data", pattern = ".csv",
             full.names = TRUE)
mat_files <-
  list.files("./data", pattern = ".mat",
             full.names = TRUE)

## Generate metadata tibbles for each file type
csv_file_info <-
  tibble(
    csv_files = csv_files,
    ## Extract the basename by removing the file extension
    basename =  basename(csv_files) %>% str_remove(".csv"),
    ## NOTE: PLEASE ADJUST THE FOLLOWING LINE TO BE ABLE TO EXCTRACT OUT THE
    ## DATE BASED ON YOUR NAMING CONVENTION
    basedate =  basename(csv_files) %>% str_sub(start = 1, end = 12)
  )
mat_file_info <-
  tibble(
    mat_files = mat_files,
    ## Extract the basename by removing the file extension
    basename =  basename(mat_files) %>% str_remove(".mat"),
    ## NOTE: AGAIN, PLEASE ADJUST THE FOLLOWING LINE TO BE ABLE TO EXCTRACT OUT
    ## THE DATE BASED ON YOUR NAMING CONVENTION
    basedate =  basename(mat_files) %>% str_sub(start = 1, end = 12)
  )

## Matchmake between .MAT data and .CSV log files
csv_mat_filejoin <-
  inner_join(csv_file_info, mat_file_info, by = "basename") %>%
  ## OPTIONAL STEP: remove any rows where either the .MAT or .CSV is missing
  drop_na()

## Store a vector of basenames in the environment. This will become useful later
base_names <- csv_mat_filejoin$basename

## Your end products from this code block should look something like:
csv_mat_filejoin

## # A tibble: 1 × 5
##   csv_files                 basename       basedate.x   mat_files        based…¹
##   <chr>                     <chr>          <chr>        <chr>            <chr>  
## 1 ./data/2023-02-16_001.csv 2023-02-16_001 2023-02-16_0 ./data/2023-02-… 2023-0…
## # … with abbreviated variable name ¹basedate.y

## and:
base_names

## [1] "2023-02-16_001"

5.1.2 Data import and preliminary labeling

We will now use the R.matlab package to import the .mat file into R and then label the spike and photodiode time series based on the information in the .csv log file

Because .mat files can be large, data import can take several minutes.

Please see in-line comments for further guidance

## Set up empty vectors that will collect sets of replicates that we will be
## splitting up
metadata_sets <- NULL
meta_splits <- NULL
data_splits <- NULL
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2454829 131.2    3878522 207.2  3878522 207.2
## Vcells 4234160  32.4   10146329  77.5  6629589  50.6

starttime <- Sys.time() ## Optional, will help you assess run time
for (i in 1:nrow(csv_mat_filejoin)) {

  ## Which file # are we working on?
  print(i)

  ## Set up temporary objects in which to eventually write data
  csv_data_sets <- NULL
  mat_data_sets <- NULL
  joined_data_sets <- NULL

  ## Import the matlab file. This may take some time
  mat_import <-
    R.matlab::readMat(csv_mat_filejoin[i,"mat_files"])

  ## Read in the corresponding csv log file
  csv_data_sets[[i]] <-
    read_csv(as.character(csv_mat_filejoin[i,"csv_files"]),
             show_col_types = FALSE) %>%
    ## Rename columns for convenience
    rename(
      Spatial_Frequency = `Spatial Frequency`,
      Temporal_Frequency = `Temporal Frequency`,
      Cycles_Per_Pixel = `Cycles Per Pixel`
    )

  ## Determine if spatial frequencies need to be transformed
  sfs <- csv_data_sets[[i]]$Spatial_Frequency %>% unique() %>% sort()
  cpps <- c(0.000668, 0.001336, 0.002670, 0.005300, 0.010600, 0.021200)
  if (all(sfs == cpps)) {
    ## If true, convert to cpd using the following mapping:
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.000668] <-
      2^-6
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.001336] <-
      2^-5
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.00267]  <-
      2^-4
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.0053]   <-
      2^-3
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.0106]   <-
      2^-2
    csv_data_sets[[i]]$Spatial_Frequency[csv_data_sets[[i]]$Spatial_Frequency == 0.0212]   <-
      2^-1
  }

  ## The log file does not have time = 0, so set up a separate tibble to
  ## add this info in later. Some of the metadata will just be filler for now.
  initial <- tibble(
    Trial = "initialization",
    Spatial_Frequency =  csv_data_sets[[i]]$Spatial_Frequency[1],
    Cycles_Per_Pixel  =  csv_data_sets[[i]]$Cycles_Per_Pixel[1],
    Temporal_Frequency = csv_data_sets[[i]]$Temporal_Frequency[1],
    Direction  = csv_data_sets[[i]]$Direction[1],
    Time = 0.000
  )

  ## Find photodiode
  ## It is almost always in channel 2, but we should be sure to check before
  ## extracting automatically
  photod_default_channel <-
    mat_import[[stringr::str_which(names(mat_import), "Ch2")[1]]]
  if (!photod_default_channel[1][[1]][1] == "waveform") {
    warning("File ", i,": Photodiode channel identity uncertain")
  }

  ## Find spikes
  ## Similarly, spikes are almost always in channel 5, but we should check
  spikes_default_channel <-
    mat_import[[stringr::str_which(names(mat_import), "Ch5")[1]]]
  if("codes" %not_in% attributes(spikes_default_channel)$dimnames[[1]]) {
    warning("File ", i,": Sorted spikes channel identity uncertain")
  }
  ## If that worked, see if we can automatically determine the "times" and
  ## "codes" slot numbers
  times_location <-
    which(attributes(spikes_default_channel)$dimnames[[1]] == "times")
  codes_location <-
    which(attributes(spikes_default_channel)$dimnames[[1]] == "codes")

  ## Find matlab's stimulus change log
  stim_change_channel <-
    mat_import[[stringr::str_which(names(mat_import), "Ch3")[1]]]
  ## Each sweep should be 5 secs. We'll check that the median is 5
  ## If this results in an error, then the channel identity could be wrong, or
  ## there may have been an issue with sweep duration during the recording
  ## process
  if(!median(round(diff(stim_change_channel[[5]][,1]),0)) == 5) {
    warning("File ", i,": stim change channel identity uncertain")
  }

  ## Determine when the onset of motion occurred according to matlab
  first_moving_mat <-
    stim_change_channel[[5]][,1][1]
  ## Find the first "moving" phase in the log file
  first_moving_csv <-
    csv_data_sets[[i]] %>%
    filter(Trial == "moving") %>%
    select(Time) %>%
    slice(1) %>%
    as.numeric()
  ## Find the first "blank" phase in the log file
  first_blank <-
    csv_data_sets[[i]] %>%
    filter(Trial == "blank") %>%
    select(Time) %>%
    slice(1) %>%
    as.numeric()
  ## Compute the difference between these two
  first_mvbl_diff <- first_moving_csv - first_blank

  ## Check to see if the final row of the metadata is "moving" and truncate
  ## if not
  ## This can effectively be done by truncating after the final "moving" phase
  max_moving_sweep <-
    max(which(csv_data_sets[[i]]$Trial == "moving"))

  first_csv_tmp <-
    bind_rows(initial, csv_data_sets[[i]]) %>%
    ## Add 1 to max moving sweep since we tacked on "initial" in previous step
    ## Then slice to restrict any extraneous partial sweeps
    slice_head(n = (max_moving_sweep + 1)) %>%
    ## Add the first event time to "Time" and subtract first_mvbl_diff (~2 secs)
    ## What this does is shift the log csv's time stamping to match the matlab
    ## file's stim change channel's time stamping
    mutate(Time = Time + first_moving_mat - first_mvbl_diff - first_blank) %>%
    ## Make character version of Time for joining later
    ## This will be crucial for _join functions
    mutate(Time_char = as.character(round(Time,3)))

  ## Duplicate the initialization for ease of setting T0
  inception <-
    initial %>%
    mutate(Time_char = as.character(round(Time,3)))
  inception$Trial[1] <- "inception"

  ## Bind the initialization rows
  first_csv <-
    bind_rows(inception, first_csv_tmp)
  ## Compute stimulus end times
  first_csv$Stim_end <- c(first_csv$Time[-1], max(first_csv$Time) + 3)

  ## Get final time
  final_time <- first_csv$Stim_end[nrow(first_csv)]

  ## Extract photodiode data
  ## First generate a time sequence to match to the photodiode trace
  Time_vec <- seq(
    from = 0.0000,
    by = 1 / 25000,
    length.out = length(photod_default_channel[9][[1]][, 1])
  )
  ## The key thing is to get a character version of time from this
  Time_char_vec <- as.character(round(Time_vec, 3))

  ## Grab the photodiode data
  photod_full <-
    tibble(Photod =
             photod_default_channel[9][[1]][, 1])
  ## Add numeric time
  photod_full$Time <-
    seq(
      from = 0.0000,
      by = 1 / 25000,
      length.out = nrow(photod_full)
    )
  options(scipen = 999)
  photod_full <-
    photod_full %>%
    ## Add the character time
    add_column(Time_char = Time_char_vec) %>%
    ## Use the charcter time to define a group
    group_by(Time_char) %>%
    ## Then average the photodiode within
    summarise(Photod = mean(Photod)) %>%
    ungroup() %>%
    mutate(Time = round(as.numeric(Time_char), 3)) %>%
    arrange(Time) %>%
    filter(Time <= final_time)

  ## Extract all spike data
  all_spike_dat <-
    tibble(
      Time =
        spikes_default_channel[times_location][[1]][, 1],
      code =
        spikes_default_channel[codes_location][[1]][, 1]) %>%
    ## Characterize time, for purposes of joining later
    mutate(Time_char = as.character(round(Time, 3)))

  ## How many distinct neurons are there?
  cell_ids <- sort(unique(all_spike_dat$code))
  n_cells <- 1:length(cell_ids)

  if(length(n_cells) > 1) { ## if there's more than one distinct neuron
    all_spike_dat_tmp <-
      all_spike_dat %>%
      ## Group by identity of spiking neuron
      group_by(code) %>%
      ## Split into separate dfs, one per neuron
      group_split()

    ## Additional cells are labeled as "Spike_n"
    all_cells <- NULL
    for (j in n_cells) {
      #print(j)

      new_name = paste0("Spikes_", cell_ids[j])
      all_cells[[j]] <-
        all_spike_dat_tmp[[j]]

      ## Consolidate to 3 decimal places
      all_cells[[j]] <-
        all_cells[[j]] %>%
        group_by(Time_char) %>%
        summarise(code = mean(code)) %>%
        mutate(code = ceiling(code)) %>%
        ungroup() %>%
        mutate(Time = round(as.numeric(Time_char), 3)) %>%
        arrange(Time) %>%
        filter(Time <= final_time)

      names(all_cells[[j]])[match("code", names(all_cells[[j]]))] <-
        new_name
      ## Replace "j" with 1 to indicate presence/absence of spike rather than
      ## cell identity
      all_cells[[j]][new_name] <- 1

      ## If the identity is 1, replace "Spikes_1" with just "Spikes"
      if (new_name == "Spikes_1") {
        names(all_cells[[j]])[match(new_name, names(all_cells[[j]]))] <-
          "Spikes"
      }
    }

    ## Consolidate
    all_spike_dat <-
      all_cells %>%
      ## Tack on additional spike columns
      reduce(full_join, by = "Time_char") %>%
      arrange(Time_char) %>%
      ## Remove time.n columns but
      ## Do not remove Time_char
      select(-contains("Time.")) %>%
      ## Regenerate numeric time
      mutate(
        Time = as.numeric(Time_char)
      ) %>%
      select(Time, Time_char, Spikes, everything()) %>%
      filter(Time <= final_time)

  } else { ## If there's only 1 neuron
    all_spike_dat <-
      all_spike_dat %>%
      group_by(Time_char) %>%
      summarise(code = mean(code)) %>%
      mutate(code = ceiling(code)) %>%
      ungroup() %>%
      rename(Spikes = code) %>%
      mutate(Time = round(as.numeric(Time_char), 3)) %>%
      arrange(Time) %>%
      filter(Time <= final_time) %>%
      select(Time, Time_char, everything())
  }

  options(scipen = 999)
  mat_data_sets[[i]] <-
    ## Generate a time sequence from 0 to final_time
    tibble(
      Time = seq(from = 0, to = final_time, by = 0.001)
    ) %>%
    ## Character-ize it
    mutate(Time_char = as.character(round(Time, 5))) %>%
    ## Join in the photodiode data
    left_join(photod_full, by = "Time_char") %>%
    select(-Time.y) %>%
    rename(Time = Time.x) %>%
    arrange(Time) %>%
    ## Join in the spike data
    left_join(all_spike_dat, by = "Time_char") %>%
    select(-Time.y) %>%
    rename(Time = Time.x) %>%
    arrange(Time) %>%
    filter(Time <= final_time) %>%
    ## Replace NAs with 0 in the Spike columns only
    mutate(
      across(starts_with("Spikes"), ~replace_na(.x, 0))
    )

  ## Merge the matlab data with the metadata
  joined_one_full <-
    mat_data_sets[[i]] %>%
    ## Join by the character version of time NOT the numerical!!
    full_join(first_csv, by = "Time_char") %>%
    ## Rename columns for clarity of reference
    rename(Time_mat = Time.x,
           Time_csv = Time.y) %>%
    ## Convert character time to numeric time
    mutate(Time = round(as.numeric(Time_char), 3)) %>%
    ## Carry metadata forward
    mutate(
      Trial = zoo::na.locf(Trial, type = "locf"),
      Spatial_Frequency = zoo::na.locf(Spatial_Frequency, type = "locf"),
      Cycles_Per_Pixel  = zoo::na.locf(Cycles_Per_Pixel, type = "locf"),
      Temporal_Frequency = zoo::na.locf(Temporal_Frequency, type = "locf"),
      Direction = zoo::na.locf(Direction, type = "locf"),
      Time_csv = zoo::na.locf(Time_csv, type = "locf"),
      Stim_end = zoo::na.locf(Stim_end, type = "locf")
    ) %>%
    ## Calculate velocity
    mutate(
      Speed = round(Temporal_Frequency/Spatial_Frequency, 0),
      Log2_Speed = log2(Speed)
    )

  ## Add info to metadata
  metadata_one_full <-
    first_csv %>%
    mutate(
      Speed = round(Temporal_Frequency/Spatial_Frequency, 0),
      Log2_Speed = log2(Speed),
      Stim_end_diff = c(0, diff(Stim_end))
    )

  ## Some quality control checks
  ## What are the stim time differences?
  stimtime_diffs <- round(metadata_one_full$Stim_end_diff)[-c(1:2)]
  ## How many total reps were recorded?
  stimtime_reps <- length(stimtime_diffs)/3
  ## What do we expect the overall structure to look like?
  stimtime_expectation <- rep(c(1,1,3), stimtime_reps)
  ## Does reality match our expectations?
  if (!all(stimtime_diffs == stimtime_expectation)) {
    ## If you get this, investigate the file further and determine what went
    ## wrong
    print("stimtime issue; investigate")
  }

  ## Sometimes the final sweep gets carried for an indefinite amount of time
  ## before the investigator manually shuts things off. The following block
  ## truncates accordingly
  mark_for_removal <-
    which(round(metadata_one_full$Stim_end_diff) %not_in% c(1, 3))
  if (any(mark_for_removal == 1 | mark_for_removal == 2)) {
    mark_for_removal <- mark_for_removal[mark_for_removal > 2]
  }
  if (length(mark_for_removal) > 0) {
    metadata_sets[[i]] <-
      metadata_one_full %>%
      filter(Time < metadata_one_full[mark_for_removal[1],]$Time)
    joined_data_sets[[i]] <-
      joined_one_full %>%
      filter(Time_mat < metadata_one_full[mark_for_removal[1],]$Time)
  } else {
    metadata_sets[[i]] <-
      metadata_one_full
    joined_data_sets[[i]] <-
      joined_one_full
  }

  ## Organize the metadata for export in the R environment
  meta_splits[[i]] <-
    metadata_sets[[i]] %>%
    ## Get rid of the non-sweep info
    filter(!Trial == "inception") %>%
    filter(!Trial == "initialization") %>%
    ## Group by trial
    #group_by(Spatial_Frequency, Temporal_Frequency, Direction) %>%
    ## Split by trial group
    group_split(Spatial_Frequency, Temporal_Frequency, Direction)

  data_splits[[i]] <-
    joined_data_sets[[i]] %>%
    ## Get rid of the non-sweep info
    filter(!Trial == "inception") %>%
    filter(!Trial == "initialization") %>%
    ## Group by trial
    group_by(Spatial_Frequency, Temporal_Frequency, Direction) %>%
    ## Split by trial group
    group_split()

  ## Do some cleanup so large objects don't linger in memory
  rm(
    first_csv, inception, initial, mat_import, first_csv_tmp,
    photod_default_channel, spikes_default_channel, photod_full,
    all_spike_dat, all_spike_dat_tmp, all_cells,
    metadata_one_full, joined_one_full, joined_data_sets,
    csv_data_sets, mat_data_sets
  )
  message("File ", i, ": ", csv_mat_filejoin[i,"basename"], " imported")
  gc()
}

## [1] 1

endtime <- Sys.time()
endtime - starttime ## Total elapsed time

## Time difference of 2.67673 mins

## Tidy up how R has been using RAM by running garbage collection
gc()

##             used   (Mb) gc trigger   (Mb)   max used    (Mb)
## Ncells   5596189  298.9   16196008  865.0   25306262  1351.6
## Vcells 204523381 1560.4  915557392 6985.2 1430558425 10914.3

## Name each data set according to the basename of the file
names(metadata_sets) <- csv_mat_filejoin$basename #base_names
names(meta_splits)   <- csv_mat_filejoin$basename #base_names
names(data_splits)   <- csv_mat_filejoin$basename #base_names

## Get organized lists of stimuli that were used
## This will ultimately be used for arranging data by stimuli in a sensible
## order
metadata_combos <- NULL
for (i in 1:length(metadata_sets)) {
  metadata_combos[[i]] <-
    metadata_sets[[i]] %>%
    ## Get unique stimulus parameters
    distinct(Spatial_Frequency, Temporal_Frequency, Speed, Direction) %>%
    arrange(Direction) %>%
    ## Sort by SF (smallest to largest)
    arrange(desc(Spatial_Frequency)) %>%
    mutate(
      ## Set a "plot_order" which provides a running order of stimuli
      plot_order = 1:length(meta_splits[[i]]),
      name = paste(Direction, "Deg,",  Speed, "Deg/s")
    )
}
names(metadata_combos) <- csv_mat_filejoin$basename #base_names

What we now have is the following:

metadata_sets: contains a list of tibbles (one per imported file), each of which comprises the stimulus log
metadata_combos: contains a list of tibbles (one per imported file), each of which comprises the stimulus log re-written in a preferred plotting order
meta_splits: a list of lists. The first level of the list corresponds to the file identities. Within each file’s list, there is an additional set of lists, each of which contains a tibble with the log info for a specific stimulus combination. Essentially the same as metadata_sets but split up on a per-stimulus basis.
data_splits: another list of lists, arranged in the same hierarchical order as meta_splits. Each tibble herein contains the spike and photodiode data from the matlab file on a per-stimulus basis.

Note that since we are only using one example file, each of these lists should only be 1 element long (with the latter two having additional elements within the first element)

5.2 Organizing replicates (required) and binning (optional)

It is common to record several different sweeps of a stimulus to collect (somewhat) independent replicates of neural responses. The primary task of this section will be to use the lists from the previous section to gather & label replicates of the same stimulus.

A secondary task is to deal with binning, if desired (highly recommended). Depending on the goals of plotting and/or analyses, it may be wise to work with either unbinned spike data or to bin the data at a convenient and sensible interval. I generally choose to work at the following levels:

Unbinned
Bin size = 10 ms
Bin size = 100 ms

Important note: Rather than provide separate code for each of these 3 scenarios, I will provide one example. Bin size must be declared beforehand – please pay attention.

## Set bin size here
## Units are in ms (e.g. 10 = 10ms)
bin_size = 10 ## 10 or 100 or 1 (1 = "unbinned")

slice_size = NULL
slicemin = NULL
slicemax = NULL
condition = NULL
if (bin_size == 10){
  slice_size <- 501
  slicemin <- 202
  slicemax <- 498
  condition <- "_binsize10"
} else if (bin_size == 100){
  slice_size <- 51
  slicemin <- 21
  slicemax <- 49
  condition <- "_binsize100"
} else if (bin_size == 1){
  slice_size <- NULL
  slicemin <- NULL
  slicemax <- NULL
  condition <- "_unbinned"
} else {
  stop("bin_size is non-standard")
}


all_replicate_data_reorganized <-
  vector(mode = "list", length = length(meta_splits))
name_sets <-
  vector(mode = "list", length = length(meta_splits))
gc()

##             used   (Mb) gc trigger   (Mb)   max used    (Mb)
## Ncells   5596004  298.9   16196008  865.0   25306262  1351.6
## Vcells 204513787 1560.4  732445914 5588.2 1430558425 10914.3

starttime <- Sys.time()
for (i in 1:length(meta_splits)){

  ## i = file number
  print(i)

  ## We'll need to collect data at a per-stimulus level and on a per-replicate
  ## level within the per-stimulus level
  ## "j" will be used to designate a unique stimulus
  ## We'll first create an empty object in which to collect stimulus-specific
  ## data
  replicate_data_reorganized <- NULL
  ## For each of j unique stimuli...
  for (j in 1:length(meta_splits[[i]])) { # j = {direction,speed}
    ## Isolate the j-th data
    d <- data_splits[[i]][[j]]
    ## And the j-th log data
    m <- meta_splits[[i]][[j]] %>%
      group_by(Trial) %>%
      ## Label separate replicates
      mutate(Replicate = row_number())

    ## Extract a stimulus label to a name_set that will be used later
    name_sets[[i]][[j]] <-
      paste(m$Direction[1], "Deg,",  m$Speed[1], "Deg/s")

    ## Set up a temporary object to deal with per-replicate data
    replicates_ordered <- NULL
    ## "k" will be used to designate replicate number
    for (k in 1:max(m$Replicate)){
      tmp <-
        m %>%
        filter(Replicate == k)

      ## If you have a complete replicate (i.e., blank, stationary, moving)
      if (nrow(tmp) == 3 ) {
        ## Grab the specific data
        doot <-
          d %>%
          filter(Time >= min(tmp$Time)) %>%
          filter(Time <= max(tmp$Stim_end))
        ## Add bin information
        doot$bin <-
          rep(1:ceiling(nrow(doot)/bin_size), each = bin_size)[1:nrow(doot)]

        if (bin_size == 1) {
          ## IF YOU ARE NOT BINNING, RUN THIS:
          replicates_ordered[[k]] <-
            doot %>%
            mutate(
              ## Construct a standardized time within the sweep
              Time_stand = Time_mat - min(Time_mat),
              ## When does the sweep begin
              Time_begin = min(Time_mat),
              ## When does the sweep end
              Time_end = max(Time_mat),
              ## Delineate the end of the blank phase
              Blank_end = tmp$Stim_end[1] - min(Time_mat),
              ## Delineate the end of the stationary phase
              Static_end = tmp$Stim_end[2] - min(Time_mat),
              ## Label the replicate number
              Replicate = k
            ) %>%
            ## Bring stim info to first few columns
            select(Speed, Spatial_Frequency, Temporal_Frequency, Direction,
                   everything()) %>%
            ## Just in case there some hang over
            filter(Time_stand >= 0)
        } else { ## IF YOU ARE BINNING, RUN THIS:

          ## First grab time and meta info
          time_and_meta <-
            doot %>%
            ## WITHIN EACH BIN:
            group_by(bin) %>%
            summarise(
              ## Label the trial
              Trial = first(Trial),
              ## Midpoint of bin
              Time_bin_mid = mean(Time_mat),
              ## Bin beginning
              Time_bin_begin = min(Time_mat),
              ## Bin end
              Time_bin_end = max(Time_mat),
              ## Spike rate = sum of spikes divided by elapsed time
              Spike_rate = sum(Spikes) / (max(Time_mat) - min(Time_mat)),
              Photod_mean = mean(Photod)
            )

          ## Now deal with Spike_N columns
          hold_spike_n <-
            doot %>%
            select(starts_with("Spikes_")) %>%
            add_column(bin = doot$bin) %>%
            add_column(Time_mat = doot$Time_mat) %>%
            group_by(bin) %>%
            summarise(across(starts_with("Spikes_"),
                             ~ sum(.x) / (max(Time_mat) - min(Time_mat))))

          ## Put them together
          replicates_ordered[[k]] <-
            time_and_meta %>%
            left_join(hold_spike_n, by = "bin") %>%
            mutate(
              ## Add in metadata (following same definitions above)
              Time_stand = Time_bin_mid - min(Time_bin_mid),
              Blank_end = tmp$Stim_end[1] - min(Time_bin_mid),
              Static_end = tmp$Stim_end[2] - min(Time_bin_mid),
              Spatial_Frequency = m$Spatial_Frequency[1],
              Temporal_Frequency = m$Temporal_Frequency[1],
              Speed = m$Speed[1],
              Direction = m$Direction[1],
              Replicate = k
            ) %>%
            ## Bring stim info to first few columns
            select(Speed, Spatial_Frequency, Temporal_Frequency, Direction,
                   everything()) %>%
            ## Just in case there some hang over
            filter(Time_stand >= 0) %>%
            filter(bin < slice_size + 1)

          rm(time_and_meta, hold_spike_n)
        }
      }
      }

    ## Now insert it within the collector of per-stimulus data
    replicate_data_reorganized[[j]] <-
      replicates_ordered %>%
      bind_rows()

    ## Now insert it within the overall data collector
    all_replicate_data_reorganized[[i]][[j]] <-
      replicate_data_reorganized[[j]]

    ## Toss out temporary objects and clean up
    rm(replicates_ordered, d, m, tmp)
    gc()
  }
}

## [1] 1

endtime <- Sys.time()
endtime - starttime

## Time difference of 2.527167 mins

gc()

##             used   (Mb) gc trigger   (Mb)   max used    (Mb)
## Ncells   5600940  299.2   16196008  865.0   25306262  1351.6
## Vcells 208863355 1593.6  585956732 4470.5 1430558425 10914.3

for (i in 1:length(all_replicate_data_reorganized)) {
  for (j in 1:length(all_replicate_data_reorganized[[i]])) {
    names(all_replicate_data_reorganized[[i]])[[j]] <- name_sets[[i]][[j]]
  }
}
names(all_replicate_data_reorganized) <- csv_mat_filejoin$basename #base_names

At the end of this process, here is how all_replicate_data_reorganized[[1]] should look:

## How long is this list?
length(all_replicate_data_reorganized[[1]])

## [1] 80

## This object contains 48 individual tibbles. Here's an example of one
## pulled at random
all_replicate_data_reorganized[[1]][sample(1:48, size = 1)]

## $`45 Deg, 128 Deg/s`
## # A tibble: 3,507 × 16
##    Speed Spatial_F…¹ Tempo…² Direc…³   bin Trial Time_…⁴ Time_…⁵ Time_…⁶ Spike…⁷
##    <dbl>       <dbl>   <dbl>   <dbl> <int> <chr>   <dbl>   <dbl>   <dbl>   <dbl>
##  1   128      0.0442    5.66      45     1 blank    99.3    99.3    99.3      0 
##  2   128      0.0442    5.66      45     2 blank    99.4    99.4    99.4      0 
##  3   128      0.0442    5.66      45     3 blank    99.4    99.4    99.4      0 
##  4   128      0.0442    5.66      45     4 blank    99.4    99.4    99.4    111.
##  5   128      0.0442    5.66      45     5 blank    99.4    99.4    99.4      0 
##  6   128      0.0442    5.66      45     6 blank    99.4    99.4    99.4      0 
##  7   128      0.0442    5.66      45     7 blank    99.4    99.4    99.4      0 
##  8   128      0.0442    5.66      45     8 blank    99.4    99.4    99.4      0 
##  9   128      0.0442    5.66      45     9 blank    99.4    99.4    99.4      0 
## 10   128      0.0442    5.66      45    10 blank    99.4    99.4    99.4      0 
## # … with 3,497 more rows, 6 more variables: Photod_mean <dbl>, Spikes_0 <dbl>,
## #   Time_stand <dbl>, Blank_end <dbl>, Static_end <dbl>, Replicate <int>, and
## #   abbreviated variable names ¹Spatial_Frequency, ²Temporal_Frequency,
## #   ³Direction, ⁴Time_bin_mid, ⁵Time_bin_begin, ⁶Time_bin_end, ⁷Spike_rate

## To consolidate this, you can use bind_rows()
all_replicate_data_reorganized[[1]] %>% bind_rows

## # A tibble: 284,067 × 16
##    Speed Spatial_F…¹ Tempo…² Direc…³   bin Trial Time_…⁴ Time_…⁵ Time_…⁶ Spike…⁷
##    <dbl>       <dbl>   <dbl>   <dbl> <int> <chr>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  1024      0.0156      16       0     1 blank    181.    181.    181.      0 
##  2  1024      0.0156      16       0     2 blank    181.    181.    181.      0 
##  3  1024      0.0156      16       0     3 blank    181.    181.    181.      0 
##  4  1024      0.0156      16       0     4 blank    181.    181.    181.      0 
##  5  1024      0.0156      16       0     5 blank    181.    181.    181.      0 
##  6  1024      0.0156      16       0     6 blank    181.    181.    181.      0 
##  7  1024      0.0156      16       0     7 blank    181.    181.    181.      0 
##  8  1024      0.0156      16       0     8 blank    181.    181.    181.    222.
##  9  1024      0.0156      16       0     9 blank    181.    181.    181.    222.
## 10  1024      0.0156      16       0    10 blank    181.    181.    181.    111.
## # … with 284,057 more rows, 6 more variables: Photod_mean <dbl>,
## #   Spikes_0 <dbl>, Time_stand <dbl>, Blank_end <dbl>, Static_end <dbl>,
## #   Replicate <int>, and abbreviated variable names ¹Spatial_Frequency,
## #   ²Temporal_Frequency, ³Direction, ⁴Time_bin_mid, ⁵Time_bin_begin,
## #   ⁶Time_bin_end, ⁷Spike_rate

Bear in mind that I am specifically describing all_replicate_data_reorganized[[1]], NOT all_replicate_data_reorganized. This is because all_replicate_data_reorganized will itself be a list that is as long as the number of distinct data files you are feeding into all of the above. Since there is only 1 example file, all_replicate_data_reorganized is a list that is only 1 entry long at its top level, and within that list is set of 48 lists (one per distinct stimulus), each of which contains a stim-specific tibble of data.

5.3 Data export

It is highly recommended that you export all_replicate_data_reorganized. I generally export all_replicate_data_reorganized as a separate csv file for each of the following conditions:

Unbinned
Bin size = 10 ms
Bin size = 100 ms

This section will provide code to export each of the above, assuming you used those bin sizes in the previous section. Please be sure to change file naming conventions and other parameters in the event you chose a different bin_size in the previous section.

## Declare export destination
export_path <- "./data/"
## The "condition" will be appended to the file name.

## Export each tibble within all_replicate_data_reorganized
for (i in 1:length(all_replicate_data_reorganized)) {
  print(i)
  dat <-
    all_replicate_data_reorganized[[i]] %>%
    bind_rows()
  write_csv(
    dat,
    file =
      paste0(
        export_path,
        names(all_replicate_data_reorganized)[i],
        condition,
        ".csv"
      )
  )
  rm(dat)
}

Re-run the above chunk of code per condition you seek to export. For me, this process creates 3 files: 2023-02-16_001_unbinned.csv, 2023-02-16_001_binsize10.csv, and 2023-02-16_001_binsize100.csv.

🐢