Extract data points from LONI UCF, in R

In Software Engineering, Snippet

Objective: Quickly extract the 3D coordinate tuple and 4th-dimensional value from a LONI UCF

Notes: Briefly, UCF coordinate point data is stored as a floating 3-tuple. UCFs can also accommodate 4-tuples, where the fourth dimensional (4D) point can be some descriptive data point, such as cortical thickness, p-value, or beta value. Reading and writing UCFs is typically done in Java. However, for quick analyses, particularly for the 4D data points, using R may be more convenient. The code snippet below extracts these data points from UCFs.

The LONI UCF format is described as

NAME
       ucf - LONI Universal Contour File format

SYNOPSIS
       #include "/usr/local/lib/loni/ucf.h"

DESCRIPTION
       Files  with  the  ucf extension are LONI Universal Contour
       Files.  These files contain structure outline  information
       with the following features.

       o      Outlines are divided into planes called levels.

       o      Levels  may  contain  any  number  of  closed loops
              called contours.

       o      Each contour contains a list of consecutive points.

       o      Each point is a floating point 3-tuple.

FIELDS
       The fields in a ucf, normally in this order, are

            <width=>
            image_width_in_pixels
            of the image from which the outlines were made.
            <height=>
            image_height_in_pixels
            of the image from which the outlines were made.  Nei-
            ther width nor height  should  be  needed,  but  some
            older software expects them.
            <xrange=>
            xlo xhi
            the real space coordinates of the extent in x for the
            volume used to draw the ucf.   Normally  this  is  in
            microns.
            <yrange=>
            ylo yhi
            the real space coordinates of the extent in y for the
            volume used to draw the ucf.   Normally  this  is  in
            microns.
            <zrange=>
            zlo zhi
            the real space coordinates of the extent in z for the
            volume used to draw the ucf.   Normally  this  is  in
            microns.
            <levels>
            number_of_levels
            contained in the ucf.
            <level_number=>
            index_of_level
            Normally  the  distance between the sampling plane of
            the level and the origin.  The first declaration  for
            starting a new level.
            <point_num=>
            number_of_points
            the  number  of  points  in the ensuing contour.  The
            first declaration for starting a new  contour  within
            the current level
            <contour_data=>
            first point x, y, z
            second point x, y, z
            in all, list of point_num points as set be the previ-
            ous declaration.
            <end of level>
            last line of a level.
            <end>
            last line of the ucf.

EXAMPLES
       This is an example of a ucf output by  the  program  maud.
       The outlines were made in the plane of two original images
       from the volume.  One image  had  two  closed  loops,  the
       other had one.

       <width=>
       512
       <height=>
       512
       <xrange=>
       0.000000 185000.000000
       <yrange=>
       0.000000 185000.000000
       <zrange=>
       1200.000000 165613.390625
       <levels>
       2
       <level number=>
       83400.000000
       <point_num=>
       794
       <contour_data=>
       61498.046875 86935.546875 83400.000000
       61895.507813 86935.546875 83400.000000
       [ 791 lines deleted ]
       62292.968750 87333.007813 83400.000000
       <end of level>
       <level number=>
       141600.000000
       <point_num=>
       303
       <contour_data=>
       88127.929688 85743.164063 141600.000000
       88525.390625 85345.703125 141600.000000
       [ 300 lines deleted ]
       90512.695313 90512.695313 141600.000000
       <point_num=>
       292
       <contour_data=>
       96474.609375 79383.789063 141600.000000
       96474.609375 79781.250000 141600.000000
       [ 289 lines deleted ]
       96474.609375 78191.406250 141600.000000
       <end of level>
       <end>

AUTHOR
       Brad Payne

Code

# References
# 1. http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
#    Apparently R's regex engine is slightly different from what I'm used to?
# 2. http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
# 3. http://en.wikibooks.org/wiki/R_Programming/Text_Processing#How_can_I_extract_a_pattern_in_a_string_.3F
# 4. http://stackoverflow.com/questions/5237557/extracting-every-nth-element-of-a-vector
# 5. http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf
# 6. http://stackoverflow.com/questions/8865633/r-data-frame-how-to-control-the-conversion-of-matrix-containing-scientific-nota

pmap <- scan("sample.ucf", character(0), sep = "\n")

# Extract lines with 4D data (ignore meta data, etc.)
# Here, "pmap" contains 67606 character elements. "lines" contains 66049.
linepos <- grep("?[0-9][.][0-9]+[E]-*[0-9]{2}[ ]", pmap)
lines <- pmap[linepos]

# Create a function to split each string using space as the delimiter
f <- function (x) strsplit(x, split = " ")

# Convert lines to a data frame (while taking care to avoid converting strings
#   into factors), so we can use "apply" to run the above function over each row.
# Applying strsplit creates a unwieldy list of 66049 lists, each containing four
#   character elements -- our x, y, z, and p.
# We proceed to unlist, which produces a vector containing 264196 character
#   elements (or 66049 * 4).  
vals <- apply(as.data.frame(lines, stringsAsFactors = F), 1, f)
unvals <- unlist(vals)

# Now we convert this vector into a 66049x4 character matrix.
m <- matrix(unvals, ncol = 4, byrow = T)

# However, to work with the values, we want them to be numeric, not character.
# We also don't want them to be in scientific notation, so we switch the matrix's
#   mode to numeric. Converting the matrix into a data frame using "as.data.frame"
#   completes this exercise.
mode(m) <- "numeric"
m <- as.data.frame(m)
colnames(m) <- c("x", "y", "z", "p")
str(m)

Notes:

The code above assumes the UCF data to be formatted in scientific notation. If the data is not in scientific notation, use:

# Extract lines with 4D data (ignore meta data, etc.)
# Here, "pmap" contains 67606 character elements. "lines" contains 66049.
linepos <- grep("?[0-9]*[.][0-9]+[ ]?[0-9]*[.][0-9]+", pmap, perl = F)
lines <- pmap[linepos]
lines <- lines[c(-1,-2,-3)]

I haven’t find a working, cleaner regex yet — thus the need to delete the first three elements in the vector (which reflect the x, y, z range info in the UCF header).

Leave a Reply