Work

Flattening 8,000-file Android APKs into a single Parquet table

A research tool that lets non-engineers grep across years of Android apps in seconds. AppInspect, built for the SFB 1187 platform-studies group at Siegen.

2021–2023 ·Software engineer · University of Siegen ·Security, Data

Notes from building AppInspect.

Between 2021 and 2023, at the SFB 1187 “Media of Cooperation” (University of Siegen), I built AppInspect for media- and platform-studies researchers. The target question: “how have TikTok’s permissions evolved across versions in China vs. the international market?” — answered in seconds, not days.

The engineering problem was a small-file-I/O problem, handled by flattening the decoded APK corpus into a single columnar table.

The code lives at jason-chao/appinspect-worker.

What’s inside an APK, and why it matters

Before the engineering: a one-screen tour of what’s in an Android .apk, because every later decision turns on it.

An APK is a zip with four interesting groups of contents: manifest, bytecode, resources, signature
An Android APK, grouped by the four kinds of question one might ask of it. Diagram by Jason Chao.

Four groups, four kinds of question. AppInspect ingests all four — the trick was making them queryable at corpus scale.

The problem nobody warns you about

An Android APK is just a zip file. So is a .docx, so is a .jar. Open one with unzip and you get a polite handful of files: a manifest, some resources, one or two classes.dex. Easy.

The problem starts when you actually want to look inside the bytecode. The standard move is to disassemble it with apktool, which expands the DEX bytecode into Smali — a human-readable assembly dialect — and unpacks every embedded resource. One real example from the AppInspect corpus:

A 7 MB APK explodes into 8,027 files and 44.6 MB on disk
A 7.09 MB APK becomes 8,027 files after disassembly with apktool. Screenshot from AppInspect corpus, University of Siegen.

A 7.09 MB ChatGPT-clone APK becomes 8,027 files across 804 folders, weighing 44.6 MB on disk. Of those 8,027 files, the vast majority are tiny Smali files — one per Java/Kotlin class — many under a kilobyte.

Now multiply that by hundreds of apps and dozens of historical versions per app. You’re suddenly running a corpus of millions of small files, and any filesystem you point at it will protest. On the ext4 volumes I was using, scanning the disassembled code of a single APK with naive grep -r took 3 to 5 minutes. Researchers wanted to scan hundreds of apps per query. The arithmetic was hostile.

This wasn’t a CPU problem. It was an I/O problem. Anybody who has tried to rsync a node_modules directory has met its smaller cousin: the per-file overhead — open, stat, seek, close — dominates the actual reading. Parallelism doesn’t help much when the bottleneck is the inode lookup.

What researchers actually wanted to ask

Before describing the fix, here’s the shape of the questions the tool needed to answer. They came from collaborators in the Digital Methods Initiative Winter School 2022 and from the SFB 1187 group:

  • Which third-party trackers ship inside this app, and how has the list changed across versions?
  • Does this app declare a SensorEventListener? (i.e. can it read the gyroscope or accelerometer without asking for a permission?)
  • Which hard-coded URLs and SDK class names appear in this version that weren’t in the previous one?
  • Is the certificate signing this APK the same certificate that signed the version published two years ago?

Each question is, at its core, a substring search across a labelled corpus of decoded files, with version and app-ID as the labels. None of them needs a call graph or a taint analysis. (More on that limitation later.)

So the engineering job was: make substring search across the corpus feel cheap.

The fix: collapse the filesystem into a Parquet table

The key move is embarrassingly simple in retrospect. Instead of letting decoded APKs sprawl across the filesystem, I flattened them into a single Apache Parquet table. One row per file inside one version of one APK. Columns capture the metadata; the file contents themselves go into either a text column or a blob column.

The schema:

ColumnTypeWhat it holds
app_idstringApp identifier from AndroidManifest.xml
version_codeintegerVersion code from AndroidManifest.xml
version_namestringHuman-readable version name
apk_sha256stringSHA-256 of the whole APK (catches tampering with same code)
relative_pathstringFile’s path inside the decoded APK
filenamestringBase filename
extensionstringFile extension (smali, xml, png, …)
content_textstringText content (Smali, XML, JSON, …) — NULL if binary
content_blobbytesBinary content (PNG, font, native .so) — NULL if text
content_modestring"text" or "blob" — used as the partition key
file_sha256stringSHA-256 of this individual file

Five things this schema buys you, none of which is obvious until you try the alternatives:

  1. Columnar reads. A Smali substring scan only needs app_id, version_code, relative_path, and content_text. Parquet reads only those columns from disk. The content_blob column — usually the heaviest — never enters memory.
  2. Compression where it matters most. Disassembled Smali is enormously repetitive: every method begins with the same tokens (.method, .locals, invoke-virtual, Landroid/...). Parquet’s per-column dictionary encoding plus Snappy collapses that down to a fraction of the on-disk size.
  3. Longitudinal queries are free. Because every version of every app shares one table, “show me how SensorEventListener references changed across TikTok versions from 2017 to 2022” is a WHERE app_id = 'com.zhiliaoapp.musically' AND content_text LIKE '%SensorEventListener%' away.
  4. Tamper detection by construction. I keep both the per-APK SHA-256 and the per-file SHA-256. Two APKs published with the same version_code but different bytes (which happens — including for malicious reasons) become distinguishable rows, not silently merged blobs.
  5. De-duplication, almost for free. Most of the files inside successive minor versions of an app are byte-identical. If you key your downstream analyses on file_sha256 you skip the duplicates without having to think about it.

I partition the table on content_mode so the text-column workload (the common case) doesn’t have to skip past binary blobs:

Parquet table partitioned by content_mode = blob / text
Parquet partitions keep the common text workload off the binary blob column. Diagram by Jason Chao.

…and inside each partition, Parquet does the rest:

A single Parquet file holding the text content of one app's history
One 9.6 MB Parquet file replaces the thousands of Smali files for an app's entire history. Diagram by Jason Chao.

A single 9.6 MB Parquet file replaces the thousands of Smali files for one app’s entire history. Order-of-magnitude figures only — I never built a proper benchmark harness, and that’s a gap I would close if I were doing this again — but in practice, scanning a hundred decoded APKs went from “go make coffee” to a few seconds on a 5-node PySpark cluster of small VMs.

The pipeline, end to end

AppInspect pipeline: researchers query a web UI; AppInspect retrieves APKs, decodes them, and analyses them via PySpark on the Parquet table
The pipeline: decode once, query many times. Architecture diagram by Jason Chao.

Three deliberate choices inside this pipeline are worth naming:

Decoupling decode from query. Decoding with apktool is slow and single-threaded per APK. Querying via Spark is fast and embarrassingly parallel. Keeping them on opposite sides of the Parquet table means we decode each APK exactly once, ever, and amortise the cost across every future query.

Horizontal over vertical. It would have been cheaper, short-term, to ask the IT department for a single beefy machine. But scaling out to a small cluster of cheap VMs meant we could pull more nodes in for a workshop week and shrink back afterwards. For a research group that runs in bursts (Winter School, paper deadline, Winter School again), that elasticity matters more than peak per-node throughput.

Two tiers of UI. The web frontend exposes both analysis templates (a one-click Permission Scan, Tracker Scan, Code Scan, Signature Comparison) and a lower-level custom text search with extension filters and version pickers. Templates lower the floor for researchers who don’t write code; custom search raises the ceiling for those who do. The same Spark queries underpin both.

What it enables — and a caveat

Three example results from the DMI 2022 cohort, all of which fall out of the same WHERE content_text LIKE '%…%' GROUP BY app_id, version_code shape:

TikTok’s permission profile diverges between markets. Looking at the historical APKs of TikTok’s Chinese version (Douyin) and the international version, the Chinese variant shows a sharp rise in manufacturer-custom permissions starting around 2019 — visible as the white “other permissions” band:

Permission categories over time, TikTok / WeChat / VK / Messenger
Permission categories declared by TikTok, WeChat, VK and Messenger over time. Visualisation by Andrea Benedetti, DensityDesign. AppInspect collaboration, SFB 1187, University of Siegen.

App size grows in step. Plotting APK filesize over time across China and international variants shows the two builds diverging in ways that are hard to explain by features alone:

Filesize over time, TikTok / WeChat / VK
APK filesize over time across TikTok (China vs. international), WeChat and VK. Visualisation by Andrea Benedetti, DensityDesign. AppInspect collaboration, SFB 1187.

The capability to read sensors changes hands across markets. Scanning Smali for SensorEventListener references over time shows the Chinese TikTok variant retaining sensor-reading code throughout, while the international variant trims it. (On Android, accessing sensors at under 50 Hz does not require a declared permission — so a permissions scan alone misses this.)

SensorEventListener references over time, by market
SensorEventListener references in the Chinese and international TikTok variants over time. Visualisation by Andrea Benedetti, DensityDesign. AppInspect collaboration, SFB 1187.

Third-party SDK divergence is immediately visible. A tracker-scan across versions of each app, laid out side by side, shows which SDKs ship in each market:

Side-by-side table of third-party SDK declarations across TikTok versions, China vs. international
Third-party SDK declarations per TikTok version, China vs. international. Output of a single AppInspect query, grouped by version. AppInspect query output, SFB 1187.

The underlying query is a few lines of PySpark over the Parquet table:

A fragment of PySpark code that extracts URLs from Smali const-string tokens and groups the result by app and version
The analysis-template layer is PySpark over the Parquet table; this fragment extracts URLs from Smali const-string tokens and groups the result by app and version. Source: AppInspect analysis template.

A specific caveat worth naming: AppInspect is large-scale text search over decompiled bytecode, not rigorous static analysis. No call graph, no data-flow, no taint tracking. Detecting a SensorEventListener reference is detecting capability, not use. Smali strings can be obfuscated, dead-stripped, or loaded via reflection — substring search will miss those.

The distinction matters. “App X contains code that can read the gyroscope” is falsifiable. “App X spies on you with the gyroscope” is not something the tool can show.

What I’d change

I stopped working on AppInspect when I left the SFB in 2024. With hindsight:

  • Open-source and containerise from day one. The code is on GitHub, but as a transparency dump rather than a runnable artefact. A Docker Compose stack and a make demo target would have multiplied the tool’s reach.
  • Build a proper benchmark harness. “A few seconds on a hundred APKs” is the kind of claim a reviewer rolls their eyes at. A small script that varies corpus size, partition strategy, and column selection would have produced numbers I could defend.
  • Add a thin call-graph layer above the substring index. Even a coarse Androguard-style cross-reference graph, stored as a second table keyed on (app_id, version_code, class_name), would let researchers ask follow-up questions ("who calls this?") without leaving the tool.
  • Treat security as an explicit requirement. Running apktool on user-uploaded APKs is a sandbox question. We never had an incident, but we never had a real defence either. A gVisor or rootless-Docker per-decode wrapper would have been the right move.
  • Take the iOS question seriously. A reviewer at RSECon23 asked what it would take to do the same for iOS. The Parquet schema ports cleanly. The decode toolchain (Mach-O, ARM64 disassembly, Bitcode where it survives) does not. It’s a real piece of work, not a porting exercise.

Publication

The design and evaluation of AppInspect and the associated digital-methods workflow are documented in:

Chao, J., van Geenen, D., Gerlitz, C. & van der Vlist, F. N. (2024). Digital methods for sensory media research: Toolmaking as a critical technical practice. Convergence: The International Journal of Research into New Media Technologies, 30(1), 236–263. https://doi.org/10.1177/13548565241226791


AppInspect was built at the SFB 1187 “Media of Cooperation” at the University of Siegen, with collaborators Anne Helmond, Esther Weltevrede, Michael Dieter, and participants of the Digital Methods Initiative Winter School 2022. The TikTok / WeChat / VK / Messenger visualisations were designed by Andrea Benedetti at DensityDesign. Source: jason-chao/appinspect-worker.