3 minute read

I’m continuing to play with the new ConvertFrom-String cmdlet (available in the last WMF 5.0 September preview released yesterday) which make the parsing job really easy for simple or complex output.

This cmdlets supports two types of modes: Basic Delimited Parsing (See yesterday’s post) and the Auto-Generated Example-Driven Parsing which I will cover in this post.

This Auto-Generated Example-Driven Parsing mode is based on the FlashExtract research work in Microsoft Research…

Important:This post is based on the September 2014 preview release of WMF 5.0. This is pre-release software, so this information may change.

The research core of FlashExtract comes from Sumit Gulwani and Vu Le: FlashExtract: A Framework for Data Extraction by Examples, PLDI 2014, Vu Le, Sumit Gulwani (Abstract / Pdf / Video)

Abstract: Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model that allows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in any underlying domain-specific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair). We describe instantiation of our framework to three different domains: text files, webpages, and spreadsheets. On our benchmark comprising 75 documents, FlashExtract is able to extract intended data using an average of 2.36 examples in 0.84 seconds per field.

NetStat.exe -na

Once again, I will work with the NetStat.execommand line, to demo ConvertFrom-String with theTemplateFileparameter. Here is the default output.

Creating the Template File

The -TemplateFile parameter allows us to specify a file that contains the data structure pattern of the information we want to automatically extract.

This file simply need to have curly braces around data that you want to extract, with a property name of your choice.

Asterisk (*) In some case the property you define can appear multiple times and you will need to use an asterisk * to indicate that this results in multiple records.

Example: Consider the following line from netstat -na

  TCP              LISTENING

You would translate it to the following (Don’t forget the whitespaces)

  {Protocol*:TCP}    {LocalAddress:}          {ForeignAddress:}              {State:LISTENING}

Missing property: In the previous example I defined 4 properties: Protocol, LocalAddress, ForeignAddress and State. However, What if a line does not contains a State information? like this one :

  UDP            *:*

If I do the following:

  {Protocol*:UDP}    {LocalAddress:}          {ForeignAddress:*:*}

State will show the same value as ForeignAddress :-/

This can be solved by adding the property State anyway with a whitespace Regex Metacharacter\s

  {Protocol*:UDP}    {LocalAddress:}            {ForeignAddress:*:*}{State:\s}

Different type of lines in NetStat -na Looking at the output of NetStat -na we can see some very different types of lines: IPV4,IPV6, with and without local/foreign ports and some without State property…

You have to identity those possible case in your template so the cmdlet knows what do we each cases.

  TCP              LISTENING
  TCP    [::]:135               [::]:0                 LISTENING
  UDP            *:*
  UDP    [::]:3389              *:*
  UDP    [::1]:1900             *:*
  UDP    [fe80::98b9:6db4:216a:2f9f%18]:1900  *:*


Given all the previous elements, here is the TemplateFile:

  {Protocol*:TCP}    {LocalAddress:}          {ForeignAddress:}              {State:LISTENING}
  {Protocol*:TCP}    {LocalAddress:}     {ForeignAddress:}       {State:ESTABLISHED}
  {Protocol*:TCP}    {LocalAddress:[::]:135}               {ForeignAddress:[::]:0}                 {State:LISTENING}
  {Protocol*:UDP}    {LocalAddress:}            {ForeignAddress:*:*}{State:\s}
  {Protocol*:UDP}    {LocalAddress:[::]:3389}              {ForeignAddress:*:*}{State:\s}
  {Protocol*:UDP}    {LocalAddress:[::1]:1900}              {ForeignAddress:*:*}{State:\s}
  {Protocol*:UDP}    {LocalAddress:[fe80::98b9:6db4:216a:2f9f%18]:1900}  {ForeignAddress:*:*}{State:\s}

netstat -na |
    ConvertFrom-String -TemplateFile .\netstat_template.txt |
    Select-Object -Property Protocol, LocalAddress, ForeignAddress, State

This is super cool !!

Extra: Retrieving the ports too !

Now we might want to split the information in the LocalAddress and have a property for the IP and another for the Port, same thing for the ForeignAddress.

We can notice that the two information are separated by a colon (:) character, so we need to split on that. Example:





And here is the final Template.

  {Protocol*:TCP}    {LocalAddress:}:{LocalPort:57037}          {ForeignAddress:}:{ForeignPort:0}              {State:LISTENING}
  {Protocol*:TCP}    {LocalAddress:}:{LocalPort:3389}       {ForeignAddress:}:{ForeignPort:51992}     {State:ESTABLISHED}
  {Protocol*:TCP}    {LocalAddress:[::]}:{LocalPort:80}                {ForeignAddress:[::]}:{ForeignPort:0}                 {State:LISTENING}
  {Protocol*:UDP}    {LocalAddress:[::]}:{LocalPort:123}               {ForeignAddress:*}:{ForeignPort:*}                    {State:\s}
  {Protocol*:UDP}    {LocalAddress:[fe80::98b9:6db4:216a:2f9f%18]}:{LocalPort:1900}  {ForeignAddress:*}:{ForeignPort:*}{State:\s}
  {Protocol*:UDP}    {LocalAddress:[fe80::98b9:6db4:216a:2f9f%18]}:{LocalPort:59108}  {ForeignAddress:*}:{ForeignPort:*}{State:\s}


netstat -na |
    ConvertFrom-String -TemplateFile .\netstat_template_with_ports.txt |
    Select-Object -Property Protocol, LocalAddress, LocalPort, ForeignAddress, ForeignPort, State |

Leave a comment