The Trouble with AST in PowerShell

Ok, if you are new to PowerShell, you aren’t going to have a clue what this ‘AST’ thing even is. AST has been covered extensively by others in the past, so I won’t go into details here (just Google PowerShell and ast). In brief, AST stands for Abstract Syntax Tree, and its purpose is to group ‘tokens’ into meaningful structures within your code. The easiest way to think of it is sort of like a module manifest, but for your code. It’s the trick that Microsoft leverages with a lot of their earlier cmdlets focused on verifying your structure, and it’s part of how the underlying PowerShell engine figures out how to break down your scripts and functions to know what’s a parameter versus a begin/process/end block. It’s also how the Get-Help cmdlet figures out what your help info is when you use Comment-Based Help.

So, if I’m not going to be writing an extensive post about AST, you’re probably wondering what this post is all about then. Well, just hold your horses cause I’m getting to it.

If you’ve spent any amount of time building your own PowerShell scripts, then you’ve hopefully at some point also graduated to developing your own script-based modules, and if you haven’t, why not? Assuming that you have, you’ve also hopefully adopted a real IDE tool, like VSCode (please, PLEASE, stop using ISE for any serious work). I myself have recently embarked on the creation of yet another module, and this one may even get published to the Gallery. The purpose of the module isn’t really important here, but some of the things I’m trying to do are how I discovered the problem I’ll be talking about here.

There are essentially two ways to create a script module; monolithic and separated. In the former model, all of your functions are defined individually within the .PSM1 file for the module, whereas in the latter, each function is defined in a separate file with a .PS1 extension. As part of your development, you are hopefully also providing a robust set of Help information, which of course you are, because you are a good and upstanding community member. Without good help info, others can’t use the tools we make properly, if at all, without having to dig in and look at the code. Even if you are just writing a script, you should be providing in-line help in my opinion. The potential gotcha however, is providing that help in a way that it can be queried properly via the Get-Help cmdlet.

In my current endeavor, I’m using the separated approach. Each individual .PS1 has a full set of Comment-Based Help. I’ve also started (finally) really ramping up on Pester. Part of the plan for this new module is that I will write a set of ‘core’ functionalities, and then others will be able to extend those functionalities in a proscribed manner. I’m modeling the module a bit after the Microsoft.PowerShell.SecretManagement module, which by itself does nothing but define a framework that others can then leverage via Extension modules to talk to individual vaulting solutions. In my case however, I’m not wanting anyone to necessarily have to write and release their own full module, so I’m providing an install kind of framework. Of course, that means I have to validate what’s coming into the tool, to ensure it has what I need, which means that, in addition to my Pester tests, I also need to perform ad-hoc validations…which is where the trouble comes in.

Sure, I could just use a regular expression to pull the whole comment block, but then I’d have to parse that block as string data. Again, I could do that, but this is PowerShell we’re talking about here, not Bash or even Python (sorry, I couldn’t help myself), so that means we should be working smarter, not harder, and that means objects. In theory, AST should make something as simple as getting the help info and serializing it into an object as a simple task. To illustrate the point, below is a simple function with some CBH that is one of my utility functions. Before anyone comments on the fact that I could just use ‘-join’, you are correct. I originally wrote the function because I had unwittingly been using a PSCX cmdlet in a LOT of places within a fairly substantial module, and then wondering why it wouldn’t work on any other machines. The utility function saved me having to go back and modify.

function Join-String {
<#
    .SYNOPSIS
        Joins two strings together using a specified separator

    .DESCRIPTION
        Joins two strings together using a specified separator

    .PARAMETER Strings
        Two or more strings to be joined passed in as an array

    .PARAMETER Separator
        The value or character that will be used to join the strings

    .EXAMPLE
        $array = Join-String -strings 'value1','value2','value3' -separator ,

        The above creates a single string from the three provided values, separated by a comma

    .INPUTS
        Inputs to this cmdlet (if any)

    .OUTPUTS
        Output from this cmdlet (if any)

    .NOTES
        KEYWORDS: PowerShell, Cmdlet

        Author: Topher Whitfield

        VERSIONS HISTORY
        0.1.0 - 2023-04-13 - New private function

    .LINK
        https://deloitte.com
#>
    [CmdletBinding()]
    [OutputType([String])]
    Param (
        [Parameter(Mandatory=$true,Position=0,ValueFromPipeline=$true,ValueFromPipelineByPropertyName=$true)]
        [string[]]
        $Strings,

        [Parameter()]
        [string]
        $Separator
    )

    begin {
    }

    process {
        if($Separator){
            $result = $Strings -join $Separator
        }else{
            $result = -join $Strings
        }
    }

    end {
        return $result
    }
}
Expand

There are now a variety of ways that we could leverage to turn this into an AST using [System.Management.Automation.Language.Parser] with either the ParseFile or the ParseInput methods, depending on whether you want to supply the content, or just a file path. The below code demonstrates each approach, assuming the function above is stored in a file called ‘join-string.ps1’ in D:\tmp. I also show an alternative third method, which I personally prefer, as there are fewer ambiguous elements.

# ParseFile
$joinAST = [System.Management.Automation.Language.Parser]::ParseFile('D:\tmp\join-string.ps1', [ref]$null, [ref]$null)

# ParseInput
$joinAST = [System.Management.Automation.Language.Parser]::ParseInput($(Get-Content D:\tmp\join-string.ps1 -raw), [ref]$null, [ref]$null)

# ScriptBlock
$joinAST = [scriptblock]::Create((Get-Content D:\tmp\join-string.ps1 -raw)).Ast

Once we have our object, as one typically would with new things in PowerShell, we should be passing the object to ‘Get-Member’ to interrogate it. We can, of course, simply dump it to ‘Format-List *’ to see the property values, but then we would miss out on discovering the methods, which matters down the road.

$joinast | gm


   TypeName: System.Management.Automation.Language.ScriptBlockAst

Name               MemberType Definition
----               ---------- ----------
Copy               Method     System.Management.Automation.Language.Ast Copy()
Equals             Method     bool Equals(System.Object obj)
Find               Method     System.Management.Automation.Language.Ast Find(System.Func[System.Management.Auto...
FindAll            Method     System.Collections.Generic.IEnumerable[System.Management.Automation.Language.Ast]...
GetHashCode        Method     int GetHashCode()
GetHelpContent     Method     System.Management.Automation.Language.CommentHelpInfo GetHelpContent()
GetScriptBlock     Method     scriptblock GetScriptBlock()
GetType            Method     type GetType()
SafeGetValue       Method     System.Object SafeGetValue()
ToString           Method     string ToString()
Visit              Method     System.Object Visit(System.Management.Automation.Language.ICustomAstVisitor astVi...
Attributes         Property   System.Collections.ObjectModel.ReadOnlyCollection[System.Management.Automation.La...
BeginBlock         Property   System.Management.Automation.Language.NamedBlockAst BeginBlock {get;}
DynamicParamBlock  Property   System.Management.Automation.Language.NamedBlockAst DynamicParamBlock {get;}
EndBlock           Property   System.Management.Automation.Language.NamedBlockAst EndBlock {get;}
Extent             Property   System.Management.Automation.Language.IScriptExtent Extent {get;}
ParamBlock         Property   System.Management.Automation.Language.ParamBlockAst ParamBlock {get;}
Parent             Property   System.Management.Automation.Language.Ast Parent {get;}
ProcessBlock       Property   System.Management.Automation.Language.NamedBlockAst ProcessBlock {get;}
ScriptRequirements Property   System.Management.Automation.Language.ScriptRequirements ScriptRequirements {get;}
UsingStatements    Property   System.Collections.ObjectModel.ReadOnlyCollection[System.Management.Automation.La...

As you can clearly see, there’s all kinds of juicy bits we should be able to use to verify elements of our function, but this is where things start to go wrong. As you test each item, you’ll notice that pretty much everything except Extent and EndBlock are empty. Hrmm…well, perhaps if we start with the help content, since there’s a method for that. Unfortunately, executing the method returns nothing back, even though we have properly formatted CBH. It took quite a bit of digging, because it seems that absolutely no one seems to have ever blogged about this topic, but the answer can sort of be found in the about_Comment_Based_Help topic, which I reviewed in desperation trying to ensure I had the correct syntax. Below is the same example that they provide in the help as the first entry in the syntax section.

function Get-Function
{
<#
.<help keyword>
<help content>
#>

  # function logic
}

As you can quickly see, this is the format that I have followed, so my syntax is correct for a function, and I have all of the required keywords present. If you scroll down to the next section however, regarding syntax for scripts, you’ll see a different syntax.

<#
.<help keyword>
<help content>
#>

function Get-Function { }

Unlike with the function example, you’ll want to note that the CBH is outside of our function definition as opposed to right after the opening brace. The reason why our values are not parsed correctly comes down to how, for this context, Microsoft is identifying something as a ‘script’ or a ‘function’. When you have a monolithic module, with a bunch of functions defined inside of a .PSM1, everything is viewed as a function. When you have a .PS1 file on the other hand, or if you are feeding in the raw content, it’s only a ‘Function’ from an AST perspective if there is no ‘Function’ keyword or opening/closing braces, otherwise it’s a ‘Script’. If we either remove the ‘function join-string {‘ and ending ‘}’ lines and try again, we find that we get a properly parsed file, and we are able to use the ‘GetHelpContent’ method to correctly serialize the help content. Obviously this isn’t something I would be interested in doing for all of my files, just to perform the validations I want to perform, or even just for the purposes of executing Pester tests. Fortunately PowerShell provides, though with a few additional steps, as demonstrated below.

$funcContent = (Get-Content D:\tmp\join-string.ps1).Where({$_ -ne "" -and $_ -notlike "using namespace*"})
$funcTrim = $funcContent | Select-Object -Index (1..($funcContent.Count - 2)) | Out-String
$funcAST = ([scriptblock]::Create(funcTrim)).Ast

In the first line, you’ll notice that we are not using the ‘-Raw’ switch. When you run ‘Get-Content’ normally, each line becomes an individual string object. If you had a $content variable and did $content.count, you’d get back a count of the lines in the file. When you add the ‘-Raw’ switch however, you instead pull back a single string object, which can be seen by again adding the ‘.Count’ and seeing that the value is ‘1’, no matter how many lines we have. Since I don’t want to parse a giant string, I leave it as individual objects, then leverage the ‘.Where’ accelerated syntax to filter out any blank lines or using entries, which are required to be at the top of a file.

For the second line, provided you are using the ‘function <name> {‘ on a single line, what we are doing is selecting everything from line two, down to the second to last line. As an aside, that’s a little known trick for accessing the last item in an array by using ‘[-1]’. The last bit, passing the results to ‘Out-String’, essentially deserializes the remaining content into the raw string. This is a required step, as AST is parsing the value as a single string. If you pass it an array of string objects, it will throw up on you.

Obviously in the last line, we are doing the same thing we did before, by creating a scriptblock and then accessing the AST property. If you were to proceed to the next logical step and run the ‘GetHelpContent’ method, you would find that you now have an object with properties representing each section of the help. This allows you to then check the values in a specific section, such as verifying the Synopsis has a length greater than 50, for example, without having to use a bunch of complicated regex…well, complicated for me anyway.

As a possibly interesting side note, the Microsoft ‘Get-Help’ clearly has some fallback behaviors of some sort, which I also discovered during this little side quest. If you just wanted to see the help, you could run ‘Get-Help D:\tmp\join-string.ps1’ and it would parse the content without issue and return a properly formatted help, and it still supports all of the normal switches (-Examples, -Full, -Parameter, etc.,). At some point I may try looking at the source code to try and figure out what they are doing, but for now I have enough to run my Pester validation tests to ensure I don’t leave out any help details, and that’s enough for me.