Extract a number from a string with base R

This problem came across my desk yesterday, and I thought it was interesting and straightforward.

The problem is extracting a “group” number from the middle of a string. The strings looked like:

strings = 
c("Group_1.Pooled", "Group_1.Sample_1", "Group_1.Sample_2", "Group_1.Sample_3", 
"Group_2.Pooled", "Group_2.Sample_1", "Group_2.Sample_2", "Group_2.Sample_3", 
"Group_3.Pooled", "Group_3.Sample_1", "Group_3.Sample_2", "Group_3.Sample_3", 
"Group_4.Pooled", "Group_4.Sample_1", "Group_4.Sample_2", "Group_4.Sample_3", 
"Group_5.Pooled", "Group_5.Sample_1", "Group_5.Sample_2", "Group_5.Sample_3"
)

So, what is the “base R” way to do this? Turns out there are many…

substring

Since the strings all start with the same bit, and our number of interest is always in the same position (position #7), we can use substring()

substring(strings, 7, 7)
 [1] "1" "1" "1" "1" "2" "2" "2" "2" "3" "3" "3" "3" "4" "4" "4" "4" "5" "5" "5"
[20] "5"

strsplit

The strings all have separators surrounding the number, so we can split the string, and then extract the part we are interested in

sapply(strsplit(strings, "_|\\."), "[[", 2)

Notice that we wanted to split on a literal ., so we escaped it with \\.

regmatches()

The two above rely on the strings being in a set pattern. If the string were not so consistent, we can use regmatches(). Since the number we are interested in is always the first number in the string, we can use regexpr() to find its position (verses gregexpr() which would find all the matches, not just the first one).

regmatches(strings, regexpr("[0-9]", strings))

gsub()

We can just gsub away all the parts we are not interested in, i.e., the part before and including the _, and the part after the literal . to the end,

gsub("^.+_|\\..+$", "", strings)

gsub() w/capture group

We can also utilize capture groups in gsub to extract the part that we want,

gsub(".*_([0-9])\\..*$", "\\1", strings)

Notice that I specify that the number is surrounded by _ and a literal . - this might be overkill. However, this is the solution I offered because knowing about capture groups together with gsub() is crazy powerful.

So why are there so many different ways to do this?

Can’t base R get its act together and just provide a single function? Preferrably with a verb name.

Actually, each of the above are doing slightly different things, and one might be better suited than another, depending on the situation.

Also, I find it kinda fun to know that I can manipulate stings in so many ways. If one does not work, I can use another.

Categories: R
Tags: R