Parsing PowerShell – A Follow-Up

Previously, I wrote an article on using AST and a specific ‘gotcha’ that I encountered, as well as how to work around it. If you haven’t read that article, I’d encourage you to do so.

The Trouble with AST in PowerShell

Today, I’m back with a new wrinkle, another work around, and a bit of a side-tip. For those that have not taken the time to read the article linked above, I’m essentially working on writing a new module that will include ‘extension’ support that ‘sort of’ leverages other modules. As part of this effort, I have various use cases around needing to parse PSM1 content in a meaningful way. Abstract Syntax Tree, or ‘AST’, is the best method for going about this because it’s how Microsoft breaks down files into meaningful blocks, and it’s more efficient than trying to use a bunch of regular expressions. Don’t get me wrong, when I can figure them out, regular expressions are awesome for trying to find pieces of content in string data or files, but with scripts, modules, and manifests, there are just too many scenarios to be able to catch them all with RegEx, and there is no reason to when someone else already solved the problem.

The ‘New’ Gotcha

In the last article, I talked about using AST specifically to look at Comment-Based Help details as part of performing some validation tests. Since then, I’ve moved on to trying to leverage AST to parse a full PSM1 file in an effort to pull out just the functions that are being exported. The ultimate goal of this exercise is to get the details on the parameters, both to ensure that certain ‘required’ parameters are included, as well as to surface the other parameters elsewhere. If you read the last article, you’ll know that I typically keep my various functions separated into their own individual PS1 files, and this caused me some difficulty because of how PowerShell categorizes a ‘Function’ versus a ‘Script’. We solved that challenge by forcing the content we retrieved into a ScriptBlock format, which allowed us to successfully leverage the ‘GetHelpContent’ method to parse the help content into a useful object.

The challenge on this article is a bit different, because the reference ‘extension’ module that I’m dealing with is not split out. Instead, the extension modules are comprised of a single PSM1, that theoretically only has a single exported function defined. Within the parent module, I have a cmdlet defined that has a standard set of parameters that all of the Extensions modules are supposed to support as a minimum. Extension authors will also be able to add in their own parameters, which the main cmdlet also needs to be aware of. In order to accomplish this, we need to efficiently parse the content of the PSM1 to retrieve the details we need, which is a perfect use case for AST.

Theoretically this is a rather simple exercise with only a few steps:

  • Get the raw content of the PSM1
  • Use the System.Management.Automation.Language.Parser class to access the base AST
  • Use the AST FindAll method to retrieve all of the ‘FunctionDefinitionAst’ items
  • Filter the list to our one command and get the ‘ParameterAst’ to get our parameter details

Unfortunately for me, when I tried to do this, I ended up getting InvalidOperation exceptions indicating that it couldn’t find the ‘FunctionDefinitionAst’ type. This was, once again, a case of not having things in a scriptblock type format, though the solve is different this time. In the previous article, we needed to force a scriptblock to even get to our AST in a way that was usable, but this time I was able to use the ‘GetScriptBlock’ method. Using this allowed me to access the FindAll method to retrieve all of the ‘FunctionDefinitionAst’ objects. Once I filtered the results of this to the single name function I needed, I then performed one last find at the FunctionDefinition scope to retrieve the ‘ParameterAst’ objects.

One quick note here before I move on. If you end up going and looking through the information on the System.Management.Automation.Language documentation page, you’ll notice that there are several items related to Parameters; ParamBlockAst, ParameterAst, ParameterBindingResult, and ParameterToken. I’ll be honest in that I have not yet had time to explore all of these, but I wanted to point out that there is a reason I opted to get the parameters in the way that I did. I could have just grabbed the whole ParamBlockAst, and I could have navigated from there to retrieve the parameters. The problem with this approach is the substantial numbers of levels you’ll end up having to navigate, though going up seems to be relatively easy.

To illustrate my point, let’s say that we have a set of parameters bound to ‘$cFuncParams’. In order to get to a simple string that contains just the value for the name of the parameter, you need a string like this; ‘($cFuncParams).Name.VariablePath.UserPath’. This is because nearly everything is layers upon layers of objects. When you start getting into things like retrieving the details on the decorations for each parameter, such as ‘[Parameter()]’, or ‘[ValidateSet()]’, or even just the class designation, it can quickly get overwhelming if you aren’t used to it.

The ‘Side-Trick’ – Parsing the PSD1

There’s obviously any number of ways to import PSD1 files as meaningful objects. If you are dealing with a generic, non-Manifest, PSD1, you could use the ‘Import-LocalizedData’ cmdlet. If you are dealing with a more complex PSD1 format, you might opt for the ‘Import-PowerShellDataFile’ cmdlet. For a module manifest, the easiest is to just use the ‘Test-ModuleManifest’ cmdlet, which results in a very nicely formatted ‘System.Management.Automation.PSModuleInfo’ object.

Unfortunately, there is one tiny little potential issue with all of these cmdlets, which is that they will only process a file path. What this amounts to is that, if you want to use an existing cmdlet, and you have your PSD data in memory, you’re going to have to output the content to a file, read it back in, and then clean things up when you’re done. Technically speaking, none of that is overly difficult, but it slows things down having to interact with the file system if you don’t really have to. This has even been the subject of an issue submitted to the PowerShell GitHub repository, though it’s not something that appears to be getting much attention as of yet.

Now, you might find yourself wondering why it is that I even needed to know this. Another article of recent posting, Read Text Content (w/o Unzipping), essentially outlines the challenge.

Since my little project is leveraging pseudo-modules to extend functionality, and since I’m requiring both a PSM1 and PSD1, plus any supplemental data files one might need, redistribution was a factor. As part of an effort to make this less of a barrier to people writing extensions to the module, I’ve written a set of cmdlets to make things easier. I have one for standing up a scaffold with all the required bits, one for ‘installing’ and registering an extension with the main module, one for removing, and also one for packing a module up for redistribution. The packing one is nothing overly special. All it does is perform validation that all the required bits are present, that the structure is good, and that all the ‘rules’ are being followed, before wrapping everything up into a Zip file. The trick, in this case, involves the installation and registration process.

My goal with the cmdlet was to provide as much flexibility as I could, so I wanted to support the ability to provide a directory, a file item, or a path string, and to be able to handle a Zip file as well. While I could have just extracted the Zip and moved on with my life, I was concerned that someone might be able to leverage that to inject something malicious into the context of the main module. To offset this, though not solve it yet, I perform an additional verification at install time to make sure everything is valid, as well as to get the required metadata to ‘register’ the extension. To do this, I needed an easy way to consistently parse the PSD1 file without having to branch off in different directions based on the input, and AST ended up being the answer.

Just as with the PS1/PSM1 files, you start off by engaging the parser, and then you find an instance of the ‘HashtableAst’, as demonstrated below.

$psdContent = Get-Content .\MyTestManifest.psd1 -Raw
$psdAst = [System.Management.Automation.Language.Parser]::ParseInput($psdContent,[ref]$null,[ref]$null)
$psdData = ($psdAst.Find({$args[0] -is [System.Management.Automation.Language.HashtableAst]},$false)).SafeGetValue()
Expand

Walking through the steps above, the first thing we do is get the raw content of the file. As always when dealing with AST, it has to be the raw content so that the parser can interpret the content. If you get the content normally via Get-Content, you are getting back an array of strings, which the parser doesn’t know how to handle. The parser needs a single string, with the entirety of the content it needs to parse in a single string objects.

Once we have our content, we feed it into the parser to create the AST object and allow further processing. If you are doing this in a script, function, or module, you can shorten the about of text required by adding ‘using namespace System.Management.Automation.Language’ to the very top of your file (it has to be the first line, or lines, before even a function definition). Doing this will enable you to use just [Parser] or [HashtableAst] instead of the full class path.

The last step is what transforms our string into an object in the manner that is supposed to be ‘safe’, meaning that there isn’t any code involved. This is a key thing as there is technically another method of ingesting a PSD1, which is to use Invoke-Expression (please do not do this). The downside here is that Invoke-Expression doesn’t really care what’s in the file and will run any commands it encounters. If all you have is a simple module manifest, then you technically end up with an object like you would expect, but if someone slipped something into the PSD1, you could get yourself into trouble. The other reason that this method should be what you use is because it’s what the ‘Import-PowerShellDataFile’ cmdlet uses during imports. You can check this out for yourself if you like, by looking at around line 70 in the source code found here.

That’s all for this round, and probably makes some sort of a record for my shortest post yet. Until next time, stay fresh cheese bags!!