Using SED to Extract Info out of Data
#############################

We will use sed which is a common tool for search and replace to extract information. Sed is very powerful because its a search and replace that can also have variables. So you can search for a part, set it to a variable, and return that variable. Using that simple technique and working line per line (as sed does per default), you can achieve the result of “extract data” out. So we will be constructing a sed string, its like an equation for sed that tells it what to find, and to print instead of what it found.

First lets cover the basics:

If you used sed your probably used to simple sed strings like this, to do simple search+replace:

s/what to find/new/

find first occurance of “what to find” and replace with “new”, sed works 1 line at a time, this just says run search+replace once per line (so only first occurance is found)

s/what to find/new/g

find every occurance of “what to find” and replace with “new”, sed works 1 line at a time, this just says run thru the whole string running search+replace (so it will replace every occurance of “what to find” in the string)

Extracting is just a wierd form of search and replace. We search the strings for “what we want to extract”, we set a variable to what we want (the backslash 1 variable), then we replace the whole string with that variable, thus the string becomes “what we want to extract”

For extracting info we will use the p
p means we want to print the answer

s/what to find/what to print/p
\(what to select\)
the above parenthisis are used as "selectors" to set the variable \1 (backslash one)
\1 we can use this in the "new" part

Useful tools (they select or do things in sed)

Legend: Tool – description

\(\) - used to set variable 1, this is used in the s/this//p part of the sed string
\1 - used to call upon variable 1, this is used in the s//this/p part of the sed string
^ - beginning of string
$ - end of string
* - find any number of occurance of last character
. - any character
[0-9] - all numbers
[a-z] - all lower letters
[a-zA-Z] - all letters
[A-Z] - all upper letters
So this:

.*
means any number of characters
That will match anything
This will also match anything, specifically any line:
^.*$

 

s/”using above tools identify the part of the string we dont want, and the part that we want we set to variable \1 using \(\)”/”we return variable \1, so just put \1″/p

Format

# sed -n 's/^stuff\(stuff\)stuff$/\1/p'

 UPDATE: the -n argument in sed, makes it so sed doesnt print anything on the screen (usually it prints the whole document with the changes inline on the document). So now that it prints nothing on the screen the changes we do are useless, right? nope.. the /p in the sed argument, asks sed to print out the line that it worked on. The end result, we only see on the screen the lines that apply. So if in the document there was a line that didnt match the sed statement, it wouldnt print (however if you didnt use -n it would of printed). Now if you used /p but didnt use -n, meaning you asked sed to print everything, but then also asked it to print on every match, well the end result would be everything would be printed, and everyline that matched would be printed two times. So logically in these data extraction scenarios we only want the data that makes sense, so we start off by printing nothing with -n arugment, and finish off by asking it to print only what applys with /p.

Example 1
##########

Here we should how to extract info at the end of the line (if it were in the middle of the line it wouldnt be much different)

Imagine the following commands output:

# btrfs subvolume list /data

ID 8364 gen 255976 top level 5 path ._share/Religion/.snapshot/c_1395201632
ID 8365 gen 255979 top level 5 path ._share/readydrop/.snapshot/c_1395201634
ID 8380 gen 256756 top level 5 path ._share/Autocars/.snapshot/c_1395288039
ID 8381 gen 256759 top level 5 path ._share/Backup/.snapshot/c_1395288045
ID 8382 gen 256762 top level 5 path ._share/Religion/.snapshot/c_1395288047
ID 8577 gen 266077 top level 5 path Pictures

Below where you see the “<anything>” or ‘anything’, i mean that it can be replaced with any human readable character (aside from an ENTER/newline, so it cant be a new line char, but it can be a tab or space or a number or a letter)

Notice that the pattern is
<anything> path <anything>
If we wanted everything after the word path, we would just want the last <anything>

If we only wanted the paths with snapshot in it
<anything> path <anything>snap<anything>
Or it can be (note: there are many ways to construct a correct sed, sometimes more then 1 way)
<anything> level 5 path <anything>snap<anything>

And we want the part that is <anything>snap<anything>
Lets say we need all of the names of the paths.

Here we say, find any line that starts with any chars/’anything’ but has space path space (” path “) followed by any chars/’anything’ (record this ‘anything’ thats after the ” path ” into placeholder 1), now take the whole line and replace it with placeholder 1 and print it out.

# btrfs subvolume list /data | sed -n 's/^.* path \(.*\).*$/\1/p'

._share/Religion/.snapshot/c_1395201632
._share/readydrop/.snapshot/c_1395201634
._share/Autocars/.snapshot/c_1395288039
._share/Backup/.snapshot/c_1395288045
._share/Religion/.snapshot/c_1395288047
Pictures

Let say we need all of the the names of the paths that are snapshots only

Here we say, find any line that starts with any chars/’anything’ but has space path space (” path “) followed by any chars/’anything’ surrounding the word ‘snap’ (record this ‘anything’ thats surrounding ‘snap’ thats after the ” path ” into placeholder 1), now take the whole line and replace it with placeholder 1 and print it out.

# btrfs subvolume list /data | sed -n 's/^.* path \(.*snap.*\).*$/\1/p'

._share/Religion/.snapshot/c_1395201632
._share/readydrop/.snapshot/c_1395201634
._share/Autocars/.snapshot/c_1395288039
._share/Backup/.snapshot/c_1395288045
._share/Religion/.snapshot/c_1395288047

Or likewise use the first output and grep

# btrfs subvolume list /data | sed -n 's/^.* path \(.*\).*$/\1/p' | grep snap

._share/Religion/.snapshot/c_1395201632
._share/readydrop/.snapshot/c_1395201634
._share/Autocars/.snapshot/c_1395288039
._share/Backup/.snapshot/c_1395288045
._share/Religion/.snapshot/c_1395288047

 Another way to extract data:

Imagine the text:

# cat input.txt
here comes the sun 123456 dun dun dun /dev/sda dadada
here comes the sun 21 dun dun dun /dev/sdb dadada
here comes the sun 74 dun dun dun /dev/sdc dadada
here comes the sun 433 dun dun dun /dev/sd1 dadada

How can we extract the numbers 123456 21 74 and 443. One way is to use the above method. Look for the numbers save them into a variable and print the variable.

Another way is to use sed for one of the main features that its made for (search and replace), and search for the text that is NOT what we want and replace it with clear/null space. So that whats left is what we want.

So my trick here will be to look for a sentence that starts with “here comes the sun” and clear out “here comes the sun”. Then it will be to look for “dun dun dun /dev/sd. dadada” and replace that with clear text.

# one way (simple generic method that does the trick for our input)
cat input.txt | sed -n 's/here comes the sun //p' | sed -n 's/ dun dun dun \/dev\/sd.* dadada//p'

# another way (2 changes - 1st change: being specific that "here comes the sun" is the beginning of the sentence by using the "^", that way if "here comes the sun" is in the results it wont be cleared out, and also being specific in a similar manner about the end of the line by using the "$" - 2nd change: I want to use "/dev/sd" instead of "\/dev\/sd" because escaping is annoying, well then change the sed boundary operator to something else like "|" so that you can use "s|change-this|to-thisp|" instead of the generic "/" so that you can use "s/change-this/to-thing/p")
cat input.txt | sed -n 's/^here comes the sun //p' | sed -n 's| dun dun dun /dev/sd.* dadada$||p'

# notice the extra space after "sun " and before the first " dun" so that the end results dont have trailing/leading spaces.

# --- side note below - showing something that will not work --- #

# wierd thing: sed has the ability to combine several serialized (via pipe) seds into one (by giving a couple or more sed expressions in the sed expressions, you can seperate sed expressions via ";"). ex: sed -n "s/change/tothis/g;s/hi/bye/g". So in our case it would look like this:
cat input.txt | sed -n 's/^here comes the sun //p;s| dun dun dun /dev/sd.* dadada$||p'
# or this way:
cat input.txt | sed -n 's/here comes the sun //p;s/ dun dun dun \/dev\/sd.* dadada//p'
# HOWEVER this method of combining expressions here will NOT WORK, why?
# lines will be printed twice due to the nature of -n and /p

End result:

# both commands - this one:
echo $(cat input.txt | sed -n 's/^here comes the sun //p' | sed -n 's| dun dun dun /dev/sd.* dadada$||p')
# or this one:
echo $(cat input.txt | sed -n 's/here comes the sun //p' | sed -n 's/ dun dun dun \/dev\/sd.* dadada//p')
# will have this output:
123456
21
74
433

# if we want the output to be all on one line with space delimiter use the "echo" trick (or any other method). like this
# first command:
echo $(cat input.txt | sed -n 's/^here comes the sun //p' | sed -n 's| dun dun dun /dev/sd.* dadada$||p')
# or this one:
echo $(cat input.txt | sed -n 's/here comes the sun //p' | sed -n 's/ dun dun dun \/dev\/sd.* dadada//p')
# the output:
123456 21 74 433

The end of this file.

 

Leave a Reply

Your email address will not be published. Required fields are marked *