Build and maintain strgroup, a high-performance Stata plugin for fuzzy string matching using Levenshtein edit distance with C plugin optimization.
Expert guidance for working with the **strgroup** Stata plugin, a high-performance fuzzy string matching tool authored by Julian Reif (University of Illinois).
**strgroup** groups similar strings based on Levenshtein edit distance with a user-specified similarity threshold. The project uses a two-layer architecture:
1. **Stata ADO layer** (`src/strgroup.ado`, `src/levenshtein.ado`) — handles syntax parsing, validation, platform detection, and plugin loading
2. **C plugin layer** (`src/c/strgroup.c`) — implements the Levenshtein algorithm and transitive-closure grouping with Union-Find optimization
Platform-specific compiled plugins (`.plugin` files) are stored in `src/` for Windows 32/64, Unix, and macOS.
All compilation is done from the `src/c/` directory. After building, copy the compiled `.plugin` file to `src/`.
When running Cygwin commands from Claude Code on Windows, use:
```bash
"C:\\cygwin64\\bin\\bash.exe" -l -c "cd '<project_root>/src/c' && x86_64-w64-mingw32-gcc -shared stplugin.c strgroup.c -O3 -funroll-loops -o strgroup.windows64.plugin"
```
Then copy to `src/`:
```bash
"C:\\cygwin64\\bin\\bash.exe" -l -c "cd '<project_root>' && cp src/c/strgroup.windows64.plugin src/strgroup.windows64.plugin"
```
```bash
"C:\\cygwin64\\bin\\bash.exe" -l -c "cd '<project_root>/src/c' && gcc -shared -mno-cygwin stplugin.c strgroup.c -O3 -funroll-loops -o strgroup.windows32.plugin"
```
From `src/c/`:
```bash
gcc -shared -fPIC -DSYSTEM=OPUNIX stplugin.c strgroup.c -O3 -funroll-loops -o strgroup.unix.plugin
```
From `src/c/`:
```bash
clang -bundle -o strgroup.macosx.x86_64 stplugin.c strgroup.c -DSYSTEM=APPLEMAC -target x86_64-apple-macos10.11
clang -bundle -o strgroup.macosx.arm64 stplugin.c strgroup.c -DSYSTEM=APPLEMAC -target arm64-apple-macos11
lipo -create -output strgroup.macosx.plugin strgroup.macosx.x86_64 strgroup.macosx.arm64
```
Stata is typically located at `C:\Program Files\Stata19\StataMP-64.exe`. Run tests in batch mode via PowerShell:
```bash
powershell.exe -Command "Start-Process -FilePath 'C:\Program Files\Stata19\StataMP-64.exe' -ArgumentList '/e do test/examples.do' -WorkingDirectory '<project_root>/test' -Wait -NoNewWindow"
```
Timing comparison (threshold=0.3, random strings 5-15 chars):
| N | v1.0.4 (no first) | v1.0.5 (no first) | Speedup |
|-------|-------------------|-------------------|---------|
| 10000 | 4.09s | 0.96s | 4.3x |
| 20000 | 16.51s | 3.72s | 4.4x |
| 30000 | 37.86s | 8.41s | 4.5x |
With `first=yes` option, speedup reaches 6.8x–10.5x.
Users install via:
```stata
net install strgroup, from("https://raw.githubusercontent.com/reifjulian/strgroup/master") replace
```
Package manifest defined in `stata.toc` and `strgroup.pkg`.
1. **Modify C source** in `src/c/strgroup.c` or `stplugin.c`
2. **Compile** for target platform(s) using commands above
3. **Copy** compiled `.plugin` to `src/`
4. **Test** via `do test/examples.do` in Stata
5. **Compare** output against `test/compare/*.dta` files
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/stata-fuzzy-string-matching-strgroup/raw