Extract data points from LONI UCF, in R

In Software Engineering, Snippet

Objective: Quickly extract the 3D coordinate tuple and 4th-dimensional value from a LONI UCF

Notes: Briefly, UCF coordinate point data is stored as a floating 3-tuple. UCFs can also accommodate 4-tuples, where the fourth dimensional (4D) point can be some descriptive data point, such as cortical thickness, p-value, or beta value. Reading and writing UCFs is typically done in Java. However, for quick analyses, particularly for the 4D data points, using R may be more convenient. The code snippet below extracts these data points from UCFs.

The LONI UCF format is described as

       ucf - LONI Universal Contour File format

       #include "/usr/local/lib/loni/ucf.h"

       Files  with  the  ucf extension are LONI Universal Contour
       Files.  These files contain structure outline  information
       with the following features.

       o      Outlines are divided into planes called levels.

       o      Levels  may  contain  any  number  of  closed loops
              called contours.

       o      Each contour contains a list of consecutive points.

       o      Each point is a floating point 3-tuple.

       The fields in a ucf, normally in this order, are

            of the image from which the outlines were made.
            of the image from which the outlines were made.  Nei-
            ther width nor height  should  be  needed,  but  some
            older software expects them.
            xlo xhi
            the real space coordinates of the extent in x for the
            volume used to draw the ucf.   Normally  this  is  in
            ylo yhi
            the real space coordinates of the extent in y for the
            volume used to draw the ucf.   Normally  this  is  in
            zlo zhi
            the real space coordinates of the extent in z for the
            volume used to draw the ucf.   Normally  this  is  in
            contained in the ucf.
            Normally  the  distance between the sampling plane of
            the level and the origin.  The first declaration  for
            starting a new level.
            the  number  of  points  in the ensuing contour.  The
            first declaration for starting a new  contour  within
            the current level
            first point x, y, z
            second point x, y, z
            in all, list of point_num points as set be the previ-
            ous declaration.
            <end of level>
            last line of a level.
            last line of the ucf.

       This is an example of a ucf output by  the  program  maud.
       The outlines were made in the plane of two original images
       from the volume.  One image  had  two  closed  loops,  the
       other had one.

       0.000000 185000.000000
       0.000000 185000.000000
       1200.000000 165613.390625
       <level number=>
       61498.046875 86935.546875 83400.000000
       61895.507813 86935.546875 83400.000000
       [ 791 lines deleted ]
       62292.968750 87333.007813 83400.000000
       <end of level>
       <level number=>
       88127.929688 85743.164063 141600.000000
       88525.390625 85345.703125 141600.000000
       [ 300 lines deleted ]
       90512.695313 90512.695313 141600.000000
       96474.609375 79383.789063 141600.000000
       96474.609375 79781.250000 141600.000000
       [ 289 lines deleted ]
       96474.609375 78191.406250 141600.000000
       <end of level>

       Brad Payne


# References
# 1. http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
#    Apparently R's regex engine is slightly different from what I'm used to?
# 2. http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
# 3. http://en.wikibooks.org/wiki/R_Programming/Text_Processing#How_can_I_extract_a_pattern_in_a_string_.3F
# 4. http://stackoverflow.com/questions/5237557/extracting-every-nth-element-of-a-vector
# 5. http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf
# 6. http://stackoverflow.com/questions/8865633/r-data-frame-how-to-control-the-conversion-of-matrix-containing-scientific-nota

pmap <- scan("sample.ucf", character(0), sep = "\n")

# Extract lines with 4D data (ignore meta data, etc.)
# Here, "pmap" contains 67606 character elements. "lines" contains 66049.
linepos <- grep("?[0-9][.][0-9]+[E]-*[0-9]{2}[ ]", pmap)
lines <- pmap[linepos]

# Create a function to split each string using space as the delimiter
f <- function (x) strsplit(x, split = " ")

# Convert lines to a data frame (while taking care to avoid converting strings
#   into factors), so we can use "apply" to run the above function over each row.
# Applying strsplit creates a unwieldy list of 66049 lists, each containing four
#   character elements -- our x, y, z, and p.
# We proceed to unlist, which produces a vector containing 264196 character
#   elements (or 66049 * 4).  
vals <- apply(as.data.frame(lines, stringsAsFactors = F), 1, f)
unvals <- unlist(vals)

# Now we convert this vector into a 66049x4 character matrix.
m <- matrix(unvals, ncol = 4, byrow = T)

# However, to work with the values, we want them to be numeric, not character.
# We also don't want them to be in scientific notation, so we switch the matrix's
#   mode to numeric. Converting the matrix into a data frame using "as.data.frame"
#   completes this exercise.
mode(m) <- "numeric"
m <- as.data.frame(m)
colnames(m) <- c("x", "y", "z", "p")


The code above assumes the UCF data to be formatted in scientific notation. If the data is not in scientific notation, use:

# Extract lines with 4D data (ignore meta data, etc.)
# Here, "pmap" contains 67606 character elements. "lines" contains 66049.
linepos <- grep("?[0-9]*[.][0-9]+[ ]?[0-9]*[.][0-9]+", pmap, perl = F)
lines <- pmap[linepos]
lines <- lines[c(-1,-2,-3)]

I haven’t find a working, cleaner regex yet — thus the need to delete the first three elements in the vector (which reflect the x, y, z range info in the UCF header).

Leave a Reply