Navigate the fragmented landscape of Hebrew and Yiddish ML datasets and models. Covers ivrit.ai (22K+ hours of Hebrew audio, whisper-large-v3 ASR variants, Yiddish models), Dicta (DictaLM 3.0 LLM family, DictaBERT variants, HeQ reading comprehension), the Israeli National NLP Program / NNLP-IL (HebrewSentiment, HebNLI), AlephBERT, and Knesset Plenums. Helps researchers and ML engineers pick the right dataset for a task by use case, license (commercial vs research), Hebrew register coverage, and model-dataset pairing. Use when choosing training data for a Hebrew NLP or ASR project, verifying license compatibility for a commercial product, finding a baseline model for a Hebrew downstream task, or exploring Yiddish ML resources. Do NOT use for Arabic NLP, general HuggingFace dataset discovery, or Hebrew OCR dataset selection (use hebrew-ocr-forms).
Trust score 88/100 (Trusted) · 7+ installs · 3 GitHub contributors · MIT license
The Israeli ML community punches above its weight, but the datasets and models are scattered. ivrit.ai publishes world-class Hebrew speech corpora on one HuggingFace org, Dicta publishes Hebrew LLMs and BERT variants on another, the Israeli National NLP Program maintains benchmarks under HebArabNlpProject. Licenses vary from fully commercial-friendly to research-only. A researcher trying to pick the right combination for fine-tuning a Hebrew sentiment classifier on customer support chat for a commercial product has to hunt across five orgs and read every dataset card.
npx skills-il add skills-il/developer-tools --skill hebrew-ml-datasets-navigator -a claude-codeI want to train a sentiment classifier on Hebrew customer support chat for a commercial SaaS product. Which dataset should I use, which starting model, and what does the license say about attribution?
I am building a Hebrew podcast transcription product. What does ivrit.ai offer, which ASR model should I use in production with low latency, and how do I handle multiple speakers?
I need a Hebrew LLM that runs on consumer hardware (16GB VRAM max) for a Hebrew product. What does Dicta offer, what are the size differences, and what are the upstream licenses?
I am researching Yiddish and looking for datasets and models for speech recognition and text processing. What is available in 2026 and what are the licenses?
Build full-stack apps on the Base44 platform using the JavaScript SDK. Covers CRUD operations, authentication, AI agents, backend functions, integrations, and real-time subscriptions.
Build Telegram bots with grammY, Telegraf, or python-telegram-bot. Covers Bot API v9.5 webhooks vs polling, inline keyboards, commands, middleware patterns, payments API, Mini Apps, and Hebrew message handling with RTL support. Use when building a Telegram bot, setting up webhooks, handling Hebrew messages in a bot, or integrating Telegram payments. Do NOT use for WhatsApp bots (use israeli-whatsapp-business), voice bots (use hebrew-voice-bot-builder), or general chatbot design patterns (use hebrew-chatbot-builder).
Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ, HebrewSentiment, Hebrew Winograd, translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP, ASR benchmarking, or general English benchmarks.
Want to build your own skill? Try the Skill Creator · Submit a Skill