SIMPLEXpress

Regular expressions simplified.

Regular expressions are an incredibly powerful tool, but let's face it: they're not the easiest to write or read, and sometimes they're too powerful for what we need.

Simple.

SIMPLEXpress uses a visible and predictable syntax to make writing expressions easier. The syntax design trades some brevity for huge gains in readability. Nearly everything is literal, except within a "unit".

Units always begin with one of two characters, '^' for a match-only unit or '~' for a snag unit, and end with a '/'. Only three symbols are always reserved: '^' and '~' to start units, and '%' to escape them. This ability to easily "switch" between literal and symbolic modes means less escaping, less memorizing reserved symbols, and less referencing online documentation and tutorials.

Expanded.

In regular expressions, you have to write your own complicated test cases for character sets, and Unicode support is only available through third-party libraries.

On the other hand, SIMPLEXpress has dozens of character classes, called "specifiers". Most also have sub-classes for more fine-grained matching. All of these specifiers automatically kick in within a unit. For example, you could use 'o' to match all common math operators, 'lu' to match all uppercase letters, and 'al' to match any alphanumeric character that isn't uppercase.

You can even match a range of Unicode characters with the specifier "u123-456"!

Express.

SIMPLEXpress was designed specifically for lexing and parsing, a task that regular expressions are infamously ill-suited for. Because of this goal, SIMPLEXpress is fast and efficient.

That second reserved symbol, '~', is used to "snag" units and arbitrary literal segments, which can then be returned on demand. This, paired with efficiency, makes SIMPLEXpress ideal for language parsing.

Syntax

Below is the basic syntax for SIMPLEXpress, according to the most recent specification draft. (Subject to change.)

Operators

Only the '^' and '~' symbols are hard-reserved. All the rest only work as described within a unit ('^.../'), excepting the '%', which works outside units when preceding a hard-reserved character.

Symbol Usage Example
^ Start unit ^.../
~ Snag, aka capture group ~.../
[ ] Set: Match any one of the unit values within. Space delimited. ^[(abc) (123))]/ matches 'abc' or '123'.
< > Literal Set: Any literal character within. ^<abc>/ matches 'a', 'b', or 'c'.
( ) Group: Allows for literal characters, strings, and further nested units within a unit. ^(a)(bc)?/ matches either 'a' or 'abc'.
% Escape following character (literal). Also works outside of units when preceding '^' or '~'. ^<12>/^%*?/ matches, '1', '2', '1*', or '2*'.
{ } Exclusion: Anything within is checked but is not returned as part of the result. Parallels regex "lookahead" and "lookbehind". ~{(abc)}/123 matches 'abc123', but returns only '123'.
. Matches any character. ^./23 matches 'z23' and anything else with a single character followed by '23'.
+ Multiple. ^(abc)+/ matches 'abc', 'abcabc', and so forth.
? Optional. abc^(123)?/ matches 'abc' and 'abc123'.
* Optional multiple. abc^(123)*/ matches 'abc', 'abc123', 'abc123123', and so forth.
#1, #2-3 (etc) Exact number or range of matches. abc^(123)#2-3/ only matches 'abc123123', or 'abc123123123'.
! NOT operator. ^!<abc>/ only matches a single character that is NOT 'a', 'b', or 'c'.
$ Line boundary (line beginning/end). ^$/abc^$/ only matches 'abc' if it is the entirety of the line.

Operators

All specifiers start with a single letter, and only function within a unit. Lowercase is a match, uppercase inverts the logic (A = NOT alphanumeric).

Specifier Usage
a alphanumeric
c classification (Reserved for later expanded character classes, such as 'c_hangal' for Hangal characters).
d digit
e extended Latin
g Greek/Coptic
i IPA (International Phonetic Alphabet)
l Latin letter
n newline ('\n')
o math operator
p punctuation
r carriage return ('\r')
s literal space
t tab
u# unicode (accepts 'u78' or 'u57-78')
w whitespace

Most specifiers can also include 'u' or 'l' after the first character to indicate uppercase or lowercase. For example, '^au/' indicates alphanumeric uppercase, while '^gl/' indicates Greek/Coptic lowercase. This will be ignored if case doesn't apply (no error.)

FAQs: SIMPLEXpress

(Click a question to view the answer.)

Why are you building SIMPLEXpress?

SIMPLEXpress is being created primarily as the keystone for Ratscript's language parser, as well as to aid in controlling content loading in Trailcrest.

Will SIMPLEXpress completely replace regular expressions?

Definitely not. Each have their place, and while SIMPLEXpress is designed to be better suited to language parsing and most common cases, regular expressions can still be useful for some more complicated and advanced cases.

What is SIMPLEXpress's license?

SIMPLEXpress is licensed under the BSD-3-Clause license, allowing you to use it freely in any project.

When will SIMPLEXpress be ready?

SIMPLEXpress is in active development. It is presently a high priority task, as we will need it for both Ratscript and Trailcrest.

Can I help with development?

We'd love your help! Check out our Developers page for details on how you can get involved.