@InterfaceAudience.Public @InterfaceStability.Stable public abstract class FileInputFormat<K,V> extends InputFormat<K,V>
InputFormat
s.
FileInputFormat
is the base class for all file-based
InputFormat
s. This provides a generic implementation of
getSplits(JobContext)
.
Implementations of FileInputFormat
can also override the
isSplitable(JobContext, Path)
method to prevent input files
from being split-up in certain situations. Implementations that may
deal with non-splittable files must override this method, since
the default implementation assumes splitting is always possible.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_LIST_STATUS_NUM_THREADS |
static String |
INPUT_DIR |
static String |
INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS |
static String |
INPUT_DIR_RECURSIVE |
static String |
LIST_STATUS_NUM_THREADS |
static String |
NUM_INPUT_FILES |
static String |
PATHFILTER_CLASS |
static String |
SPLIT_MAXSIZE |
static String |
SPLIT_MINSIZE |
Constructor and Description |
---|
FileInputFormat() |
Modifier and Type | Method and Description |
---|---|
static void |
addInputPath(Job job,
Path path)
Add a
Path to the list of inputs for the map-reduce job. |
protected void |
addInputPathRecursively(List<FileStatus> result,
FileSystem fs,
Path path,
PathFilter inputFilter)
Add files in the input path recursively into the results.
|
static void |
addInputPaths(Job job,
String commaSeparatedPaths)
Add the given comma separated paths to the list of inputs for
the map-reduce job.
|
protected long |
computeSplitSize(long blockSize,
long minSize,
long maxSize) |
protected int |
getBlockIndex(BlockLocation[] blkLocations,
long offset) |
protected long |
getFormatMinSplitSize()
Get the lower bound on split size imposed by the format.
|
static boolean |
getInputDirRecursive(JobContext job) |
static PathFilter |
getInputPathFilter(JobContext context)
Get a PathFilter instance of the filter set for the input paths.
|
static Path[] |
getInputPaths(JobContext context)
Get the list of input
Path s for the map-reduce job. |
static long |
getMaxSplitSize(JobContext context)
Get the maximum split size.
|
static long |
getMinSplitSize(JobContext job)
Get the minimum split size
|
List<InputSplit> |
getSplits(JobContext job)
Generate the list of files and make them into FileSplits.
|
protected boolean |
isSplitable(JobContext context,
Path filename)
Is the given filename splittable? Usually, true, but if the file is
stream compressed, it will not be.
|
protected List<FileStatus> |
listStatus(JobContext job)
List input directories.
|
protected FileSplit |
makeSplit(Path file,
long start,
long length,
String[] hosts)
A factory that makes the split for this class.
|
protected FileSplit |
makeSplit(Path file,
long start,
long length,
String[] hosts,
String[] inMemoryHosts)
A factory that makes the split for this class.
|
static void |
setInputDirRecursive(Job job,
boolean inputDirRecursive) |
static void |
setInputPathFilter(Job job,
Class<? extends PathFilter> filter)
Set a PathFilter to be applied to the input paths for the map-reduce job.
|
static void |
setInputPaths(Job job,
Path... inputPaths)
Set the array of
Path s as the list of inputs
for the map-reduce job. |
static void |
setInputPaths(Job job,
String commaSeparatedPaths)
Sets the given comma separated paths as the list of inputs
for the map-reduce job.
|
static void |
setMaxInputSplitSize(Job job,
long size)
Set the maximum split size
|
static void |
setMinInputSplitSize(Job job,
long size)
Set the minimum input split size
|
static FileStatus |
shrinkStatus(FileStatus origStat)
The HdfsBlockLocation includes a LocatedBlock which contains messages
for issuing more detailed queries to datanodes about a block, but these
messages are useless during job submission currently.
|
createRecordReader
public static final String INPUT_DIR
public static final String SPLIT_MAXSIZE
public static final String SPLIT_MINSIZE
public static final String PATHFILTER_CLASS
public static final String NUM_INPUT_FILES
public static final String INPUT_DIR_RECURSIVE
public static final String INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
public static final String LIST_STATUS_NUM_THREADS
public static final int DEFAULT_LIST_STATUS_NUM_THREADS
public static void setInputDirRecursive(Job job, boolean inputDirRecursive)
job
- the job to modifyinputDirRecursive
- public static boolean getInputDirRecursive(JobContext job)
job
- the job to look at.protected long getFormatMinSplitSize()
protected boolean isSplitable(JobContext context, Path filename)
FileInputFormat
always returns
true. Implementations that may deal with non-splittable files must
override this method.
FileInputFormat
implementations can override this and return
false
to ensure that individual input files are never split-up
so that Mapper
s process entire files.context
- the job contextfilename
- the file name to checkpublic static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
job
- the job to modifyfilter
- the PathFilter class use for filtering the input paths.public static void setMinInputSplitSize(Job job, long size)
job
- the job to modifysize
- the minimum sizepublic static long getMinSplitSize(JobContext job)
job
- the jobpublic static void setMaxInputSplitSize(Job job, long size)
job
- the job to modifysize
- the maximum split sizepublic static long getMaxSplitSize(JobContext context)
context
- the job to look at.public static PathFilter getInputPathFilter(JobContext context)
protected List<FileStatus> listStatus(JobContext job) throws IOException
job
- the job to list input paths for and attach tokens to.IOException
- if zero items.protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException
result
- The List to store all files.fs
- The FileSystem.path
- The input path.inputFilter
- The input filter that can be used to filter files/dirs.IOException
public static FileStatus shrinkStatus(FileStatus origStat)
listStatus(JobContext)
to scan more files with less
memory footprint.origStat
- The fat FileStatus.BlockLocation
,
HdfsBlockLocation
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts)
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)
public List<InputSplit> getSplits(JobContext job) throws IOException
getSplits
in class InputFormat<K,V>
job
- the job contextInputSplit
s for the job.IOException
protected long computeSplitSize(long blockSize, long minSize, long maxSize)
protected int getBlockIndex(BlockLocation[] blkLocations, long offset)
public static void setInputPaths(Job job, String commaSeparatedPaths) throws IOException
job
- the jobcommaSeparatedPaths
- Comma separated paths to be set as
the list of inputs for the map-reduce job.IOException
public static void addInputPaths(Job job, String commaSeparatedPaths) throws IOException
job
- The job to modifycommaSeparatedPaths
- Comma separated paths to be added to
the list of inputs for the map-reduce job.IOException
public static void setInputPaths(Job job, Path... inputPaths) throws IOException
Path
s as the list of inputs
for the map-reduce job.job
- The job to modifyinputPaths
- the Path
s of the input directories/files
for the map-reduce job.IOException
public static void addInputPath(Job job, Path path) throws IOException
Path
to the list of inputs for the map-reduce job.job
- The Job
to modifypath
- Path
to be added to the list of inputs for
the map-reduce job.IOException
public static Path[] getInputPaths(JobContext context)
Path
s for the map-reduce job.context
- The jobPath
s for the map-reduce job.Copyright © 2023 Apache Software Foundation. All rights reserved.